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PROGRESS  REPORT 


1.  Introduction 

It  seems  improbable  that  a  single  database  or  operating  system  will  suffice  to  solve 
all  the  application  problems  that  are  likely  to  arise  in  future  real-time,  embedded  systems. 
A  much  more  likely  scenario  is  that  future  engineers,  with  suppon  from  a  programming 
environment,  will  select  and  adapt  modules  from  program  libraries.  The  selected 
modules  must  have  proven  operating  characteristics  and  the  domain  over  which  they  are 
applicable  must  be  well-defined. 

The  StarLite  Project,  which  is  supported  by  the  Office  of  Naval  Research,  has  the 
goal  of  constructing  such  a  program  library  for  real-time  applications.  The  initial  focus 
of  the  project  is  on  operating  system  and  database  support 

Another  goal  of  the  StarLite  project  is  to  test  the  hypothesis  that  a  host  prototyping 
environment  can  be  used  to  significantly  accelerate  our  ability  to  perform  experiments  in 
the  areas  of  operating  systems,  databases,  and  network  protocols.  The  primary  project 
requirement  for  StarLite  is  that  software  developed  in  the  prototyping  environment  must 
be  capable  of  being  retargeted  to  different  architectures  only  by  recompiling  and 
replacing  a  few  low-level  modules.  The  anticipated  benefits  are  fast  prototyping  times, 
greater  sharing  of  software  in  the  research  community,  and  the  ability  for  one  research 
group  to  validate  the  claims  of  another  by  replicating  experimental  conditions  exactly. 

As  one  measure  of  the  effectiveness  of  the  environment,  it  is  often  possible  to  fix 
errors  in  the  operating  system,  compile,  and  reboot  the  StarLite  virtual  machine  in  less 
than  twenty  seconds.  The  coiaj;ilation  time  on  a  SUN  3/280  for  the  66  modules  (7500 
lines)  that  comprise  the  operating  system  is  one  minute  (clock)  or  16  seconds  (user  time). 
At  the  present  time,  all  components  execute  on  SUN  workstations  using  the  StarLite 
Modula-2  system. 

The  StarLite  prototyping  architecture  is  designed  to  support  the  simultaneous 
execution  of  multiple  operating  systems  in  a  single  address  space.  For  example,  to 
prototype  a  distributed  operating  system,  we  might  want  to  initiate  a  file  server  and 
several  clients.  Each  virtual  machine  would  have  its  own  operating  system  and  user 
processes.  All  of  the  code  and  data  for  all  of  the  virtual  machines  would  be  executed  as  a 
single  UNIX  process. 

In  order  to  support  this  requirement,  we  assume  the  existence  of  high-performance 
workstations  with  large  local  memories.  Ideally,  we  would  prefer  multi-thread  support, 
but  multiprocessor  workstations  are  not  yet  widely  available.  We  also  assume  that 
hardware  details  can  be  isolated  behind  high-level  language  interfaces  to  the  extent  that 
the  majority  of  a  system’s  software  remains  invariant  when  retargeted  from  the  host  to  a 
target  architecture. 

StarLite  has  matured  to  the  point  that  we  are,  in  the  coming  year,  pursuing  several 
technology  transition  possibilities.  We  will  briefly  describe  several  of  our  ideas.  The 
progress  to  date  for  each  of  the  StarLite  components  is  covered  in  later  Sections. 

First,  we  think  that  it  might  be  possible  to  convert  our  whole  environment  over  to  an 
Ada  subset.  Note  the  word  "subset";  no  one  in  a  University  could  do  a  full  Ada 
environment  with  our  funding  level.  The  advantage  is  that  we  could  take  advantage  of 
StarLite’s  portability  to  provide  a  direct  technology  transfer  mechanism  for  industry. 
Second,  we  have  started  work  with  IBM  Manassas  to  combine  our  database  work  with 
the  CMU  work  on  the  ARTS  kernel.  The  database  group  at  NOSC  is  waiting  to  use  the 
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resulting  system  as  a  testbed.  Third,  we  are  attempting  to  convince  CLI,  which  has  a 
DARPA  contract  to  formally  verify  Kemelized  Mach,  to  adopt  our  modular  design 
approach  so  that  we  could  work  jointly  on  the  project.  This  would  result  in  a  modular 
Kemelized  Mach  running  under  the  StarLite  environment.  The  advantage  is  that  Mach 
could  then  be  widely  distributed  as  a  research  vehicle.  With  more  researchers  working 
on  making  Mach  better,  DOD  and  its  other  users  would  benefit.  Furthermore,  the 
StarLite  Mach  could  serve  as  a  training  device  for  defense  contractors  who  wished  to 
build  their  own  OS  on  top  of  the  Mach  kernel.  Fourth,  some  of  the  people  in  the  43RSS 
development  group  have  agreed  to  work  with  us  to  use  their  Pulse  Detection  System 
specification  as  a  testbed  for  our  operating  system  and  database  work.  We  have  already 
started  the  implementation.  The  result  will  be  a  real  test  case  that  can  be  widely 
distributed  to  other  researchers.  It  will  also  be  a  better  testbed  for  our  new  algorithms 
than  the  random  parameter  ranges  that  we  currently  use. 


2.  Related  Activities 

•  Cook,  General  Chairman,  Seventh  IEEE  Workshop  on  Real-Time  Software  and 
Operating  Systems,  Charlottesville,  VA  (1990). 

•  Son,  participation  in  the  coordination  meeting  with  Prof.  Tokuda  from  CMU  and  Pat 
Watson  from  IBM  (Sept.  1989). 

•  Cook,  participant  in  IDA/ONT/ONR/SNSC  Workshop  on  Operating  Systems  for 
Mission  Critical  Computing  (Sept.  1989). 

•  Cook,  invited  participant  in  AIA/SEI  Workshop  on  Research  Advances  Required  for 
Real-Time  Software  Systems  in  the  ’90s  (Sept.  1989). 

•  Son,  presentation  at  the  18th  International  Conference  on  Parallel  Processing  (Aug. 
1989) 

•  Son,  participation  in  the  IEEE  Data  Engineering  Conference  program  committee 
meeting  (Aug.  1989). 

•  Son,  participation  in  the  ACM  SIGMOD  Conference  (June  1989). 

•  Son,  invited  talk  at  Stanford  University  on  real-time  databases  (June  1989). 

•  Cook  and  Son,  accepted  for  Tools  Fair  presentation  of  StarLite  at  the  1 1th  International 
Conference  on  Software  Engineering,  (May  1989). 

•  Cook,  Session  Chair,  Sixth  IEEE  Workshop  on  Real-Time  Software  and  Operating 
Systems,  (May  1989). 

•  Cook,  invited  participant  at  the  IEEE  Indialantic  Workshop  on  Tools  and  Environments 
for  Reuse,  (May  1989). 

•  Son,  participation  in  the  Camegie-Mellon  University  ARTS  project  meeting  (May 
1989). 

•  Son,  presentation  at  the  International  Symposium  on  Database  Systems  for  Advanced 
Applications  (April  1989). 

•  Son,  Session  Chair  and  panelist  at  the  International  Symposium  on  Database  Systems 
for  Advanced  Applications  (April  1989). 

•  Son,  presentation  at  the  IEEE  INFOCOM  ’89  (April  1989). 

•  Cook  and  Son,  presentation  at  the  ACM  Conference  on  Hypercube  Concurrent 
Computers  and  Applications  (March  1989). 

•  Son,  invited  talk  at  the  NSWC  on  reliable  distributed  database  systems  (March  1989). 
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•  Son,  participation  in  the  Real-Time  Systems  Symposium  (Dec.  1988). 

•  Cook  and  Son,  presentation  at  the  ONR  Foundations  of  Real-Time  Computing 
Research  Initiative  Workshop  (Nov.  1988). 


3.  Student  Participation 

Chun-Hyon  Chang  (Post  Doc.),  priority-based  contention  protocols 
Anthony  Burrell  (Ph.D.  student),  real-time  operating  system  scheduling 
Shi-Chin  Chiang  (Ph.D.  Student),  checkpointing  in  distributed  database  systems 
Lee  Hsu  (Ph.D  student),  just  getting  started 
Ying-Feng  Oh  (Ph.D.  student),  just  getting  started 
Juhnyoung  Lee  (Ph.D.  student),  just  getting  started 

Jeremiah  Ratner  (Ph.D.  student),  synchronization  protocols  for  real-time  systems 

Ambar  Sarkar  (Ph.D.  student),  real-time,  fault-tolerant  network  protocols 

David  Duckworth  (M.S.  student),  Modula-2  to  C  compiler 

Greg  Fife  (M.S.  student),  real-time,  distributed,  site  atomic  transactions 

Navid  Haghighi  (M.S.  student),  multi-version  database  performance  evaluation 

Chris  Koeritz(M.Sc.  student),  real-time  operating  system 

Marc  Poris  (M.S.  student),  integration  of  a  database  with  real-time  kernel 

Paul  Shebalin  (M.S.  student),  software  safety  in  real-time  systems 

Alan  Tuten  (M.S.  student),  relational  database  extension 

Prasad  Wagle  (M.S.  student),  temporal  consistency  issues 

Richard  McDaniel(B.S.  student),  prototyping  environment 


4.  Publications  Since  September  1988 
•  Journal  Publications 

(1)  Cook,  R.  P.,  "An  Empirical  Analysis  of  the  Lilith  Instruction  Set,"  IEEE 
Transactions  on  Computers  38,  l(Jan.  1989)  156-158. 

(2)  Cook,  R.P.,  "StarMod-A  Language  for  Distributed  Programming,"  reprinted  in 
Concurrent  Programming,  Addison-Wesley,  edited  by  N.  Gehani  and  A.D. 
McGettrick,  (1988). 

(3)  Son,  S.  H.,  "An  Adaptive  Checkpointing  Scheme  for  Distributed  Databases  with 
Mixed  Types  of  Transactions,"  IEEE  Transactions  on  Knowledge  and  Data 
Engineering,  (Dec.  1989),  to  appear. 

(4)  Son,  S.  H.,  "An  Algorithm  for  Non-Interfering  Checkpoints  and  its  Practicality  in 
Distributed  Database  Systems,"  Information  Systems,  (Dec.  1989),  to  appear. 

(5)  Son,  S.  H.  and  A.  Agrawala,  "Distributed  Checkpointing  for  Globally  Consistent 
States  of  Databases,"  IEEE  Transactions  on  Software  Engineering  15,  10(Oct. 
1989)  1157-1167. 
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(6)  Son,  S.  H.,  "Recovery  in  Main  Memory  Database  Systems  for  Engineering  Design 
Applications,"  Information  and  Software  Technology  31,  1  (March  1989)  85-90. 

(7)  Son,  S.  H.,  "Checkpointing  and  Recovery  in  Distributed  Database  Systems,"  Data 
Engineering  12,  l(March  1989)  44-50. 

(8)  Son,  S.  H.,  "An  Algorithm  for  Efficient  Decentralized  Checkpointing,"  Journal  of 
Computer  Systems  Science  and  Engineering  4,  l(Jan.  1989)  27-34. 

(9)  Son,  S.  H.,  "Replicated  Data  Management  in  Distributed  Database  Systems,"  ACM 
SIGMOD  Record  17,  4(Dec.  1988)  62-69. 

(10)  Son,  S.  H.,  "Semantic  Information  and  Consistency  in  Distributed  Real-Time 
Systems,"  Information  and  Software  Technology  30,  3(Sept.  1988)  443-449. 


•  Refereed  Conference  Publications 

(11)  Cook,  R.  P.,  "The  StarLite  Operating  System,"  Workshop  on  Operating  Systems 
for  Mission-Critical  Computing,  (Sept.  1989)  J1-J7. 

(12)  Son,  S.  H.  and  N.  Haghighi,  "Performance  Evaluation  of  Multiversion  Database 
Systems,"  Sixth  IEEE  International  Conference  on  Data  Engineering,  Los 
Angeles,  California,  (Feb.  1990),  to  appear. 

(13)  Son,  S.  H.,  "On  Priority-Based  Synchronization  Protocols  for  Distributed  Real- 
Time  Database  Systems,"  IFAC/IFIP  Workshop  on  Distributed  Databases  in 
Real-Time  Control  Budapest,  Hungary,  (Oct.  1989),  to  appear. 

(14)  Son,  S.  H.  and  Y.  Kim,  "A  Software  Prototyping  Environment  and  Its  Use  in 
Developing  a  Multiversion  Distributed  Database  System,"  18th  International 
Conference  on  Parallel  Processing,  St  Charles,  Illinois,  (Aug.  1989)  81-88. 

(15)  Son,  S.  H.  and  R.  Cook,  "Scheduling  and  Consistency  in  Real-Time  Database 
Systems,"  Sixth  IEEE  Workshop  on  Real-Time  Operating  Systems  and  Software, 
Pittsburgh,  Pennsylvania,  (May  1989)  42-45. 

(16)  Son,  S.  H.  and  C.  Chang,  "Distributed  Real-Time  Database  Systems:  Prototyping 
and  Preformance  Evaluation,"  International  Symposium  on  Database  Systems  for 
Advanced  Applications,  Seoul,  Korea,  (April  1989)  251-258. 

(17)  Son,  S.  H.  and  H.  Kang,  "Approaches  to  Design  of  Real-Time  Database  Systems," 
International  Symposium  on  Database  Systems  for  Advanced  Applications,  Seoul, 
Korea,  (April  1989)  274-281. 

(18)  Son,  S.  H.,  "A  Resilient  Replication  Method  in  Distributed  Database  Systems," 
lEEEINFOCOM  ’89,  Ottawa,  Canada,  (April  1989)  363-372. 
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(19)  Son,  S.  H.,  J.  Pfaltz,  and  J.  French,  "Synchronization  of  Replicated  Data  in  Parallel 
Database  Systems,"  Fourth  ACM  Conference  on  Hypercube  Concurrent 
Computers  and  Applications,  Monterey,  California,  (March  1989). 

(20)  Son,  S.  H.,  R.  Cook  and  .T.  Ratner,  "Communication  Paradigms  for  Message-Based 
Multicomputer  Systems,"  Fourth  ACM  Conference  on  Hypercube  Concurrent 
Computers  and  Applications,  Monterey,  California,  (March  1989). 

(21)  Pfaltz,  J.,  J.  French,  and  S.  H.  Son,  "Parallel  Set  Operators,"  Fourth  ACM 
Conference  on  Hypercube  Concurrent  Computers  and  Applications,  Monterey, 
California,  (March  1989). 


•  Technical  Reports 

(22)  Son,  S.  H.  and  J.  Ratner,  "StarLite;  An  Environment  for  Distributed  Database 
Prototyping,"  Technical  Report  TR-89-05,  Dept,  of  Computer  Science,  University 
of  Virginia,  (Aug.  1989). 

(23)  Son,  S.  H.  and  N.  Haghighi,  "Performance  Evaluation  of  Multiversion  Database 
Systems,"  Technical  Report  IPC-TR-89-007,  Institute  for  Parallel  Computation, 
University  of  Virginia,  (July  1989). 

(24)  Son,  S.  H.  and  N.  Haghighi,  "Multiple  Data  Versions  in  Database  Systems," 
Technical  Report  TR-89-01,  Dept,  of  Computer  Science,  University  of  Virginia, 
(June  1989). 


5.  The  Prototyping  Environment 

The  components  of  the  environment  include  a  Modula-2  compiler,  a  symbolic 
debugger,  a  window  package,  an  interpreter/runtime  package,  the  Phoenix  operating 
system,  the  concurrency  control  algorithm  testbed,  a  simulation  package,  and 
documentation. 

During  the  past  year,  the  windows  package  was  extended  to  suppon  bit-mapped 
graphics  operations.  As  a  result,  we  were  able  to  implement  a  number  of  suppon  tools 
for  profiling,  graphing,  and  visual  simulation.  Also,  the  debugger  was  rewritten  to  be 
window-based  and  mouse-driven.  This  also  involved  changing  the  compiler  so  that 
breakpoints  worked  correctly. 

One  of  the  problems  with  the  environment  is  the  delay  introduced  by  using  an 
interpreter.  This  problem  is  being  addressed  in  two  ways.  First,  we  performed  a  static 
and  dynamic  analysis  of  instruction  opcode  usage  as  a  prerequisite  to  improving  the 
interpreter’s  architecture.  Secondly,  we  think  that  we  have  found  a  way  to  support 
"mixed"  execution;  that  is,  a  program  that  combines  interpreted  code  and  native  machine 
code.  If  our  design  works,  all  of  the  tools  will  continue  to  work  but  users  can  mix  and 
match  machine  language  modules  for  significant  performance  gains.  We  believe  that  this 
goal  can  be  achieved  without  sacrificing  portability. 
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As  the  system  has  grown  larger,  it  has  become  more  difficult  to  synchronize 
changes  that  propagate  through  multiple  modules.  To  address  this  problem,  we 
implemented  a  simple  "make"  utility  that  automatically  compiles  dependent  modules.  It 
is  simpler  to  use  than  UNIX  "make"  and  avoids  unnecessary  compilations. 

In  summary,  the  environment  is  designed  to  maximize  productivity.  Therefore,  it 
accelerates  a  researcher’s  ability  to  conduct  experiments,  which  advances  the  state-of- 
the-art  While  the  initial  version  of  the  environment  executes  as  a  single  UNIX  process, 
future  versions  could  take  excellent  advantage  of  both  load  balancing  to  distribute  a 
running  prototype  across  a  number  of  machines  and  of  multiprocessor  support,  such  as  is 
found  in  Mach  or  Taos. 


6.  Operating  System 

During  the  past  year,  the  operating  system  implementation  was  modified  to  execute 
on  the  multiprocessor  machine  model  as  well  as  the  distributed  nodes.  Quite  a  bit  of 
effort  was  invested  in  the  efficient  use  o.  spin  locks.  As  a  result,  we  have  invented  a  new 
method  for  handling  synchronization  within  the  operating  system.  The  new  method 
should  decrease  the  cost  of  lock  overhead  dramatically. 

We  also  experimented  with  techniques  to  minimize  interrupt  latency  in  the 
operating  system.  This  effort  was  successful  and  resulted  from  the  isolation  of  the  use  of 
DISABLE  to  only  two  modules. 

We  also  rewrote  the  SDB  relational  database  system  provided  to  Professor  Son  by 
Pat  Watson  from  IBM  Manassas.  We  call  our  system  RDB  for  Real-time  Database.  Our 
version  corrects  a  number  of  defects  in  SDB.  It  is  reentrant,  can  be  preempted,  supports 
more  flexible  query  processing,  and  it  has  more  data  types  than  SDB. 

We  experimented  with  a  new  dynamic  binding  mechanism  for  operating  system 
services.  The  intent  is  to  make  it  easy  for  application  engineers  to  adapt  the  operating 
system  to  meet  the  requirements  imposed  by  hard  real-time  tasks.  For  example,  they 
might  want  a  file  system  without  naming  to  improve  performance  and  predictability. 

We  experimented  with  and  implemented  a  Volume  Standard  Format.  The  purpose 
of  a  VSF  is  to  make  it  possible  for  multiple  operating  systems  to  share  files  but  without 
sacrificing  their  own  disk  layouts  or  naming  conventions.  When  VSF  is  perfected,  it  will 
be  suitable  for  VLSI  implementation  as  a  national  standard  candidate. 


7.  Database  Systems 

Compared  with  traditional  databases,  real-time  database  systems  have  a  distinct 
feature:  they  must  satisfy  the  timing  constraints  associated  with  transactions.  In  other 
words,  "time"  is  one  of  the  key  factors  to  be  sidered  in  real-time  database  systems. 
Transactions  must  be  scheduled  in  such  a  way  tl.,.,  they  can  be  completed  before  their 
corresponding  deadlines  expire.  For  example,  both  the  update  and  query  operations  on 
the  tracking  data  of  a  missile  must  be  processed  within  the  given  deadlines:  otherwise, 
the  information  provided  could  be  of  little  value.  State-of-the-an  database  systems  are 
typically  not  used  in  real-time  applications  due  to  two  inadequacies:  poor  performance 
and  lack  of  predictability.  Current  database  systems  do  not  schedule  their  transactions  to 


meet  response  requirements  and  they  commonly  lock  data  tables  indiscriminately  to 
assure  database  consistency.  Locks  and  time-driven  scheduling  are  basically 
incompatible.  Low  priority  transactions  can  and  will  block  higher  priority  transactions 
leading  to  response  requirement  failures.  New  techniques  that  are  compatible  with  time- 
driven  scheduling  and  provide  system  response  predictability  need  to  be  investigated. 

Our  research  effort  during  October  1988  to  September  1989  was  concentrated  in 
three  areas:  investigating  new  techniques  for  real-time  database  systems,  integrating  a 
relational  database  system  with  the  real-time  operating  system  kernel  ARTS,  and 
developing  a  message-based  database  prototyping  environment  for  empirical  study.  In 
addition,  we  have  evaluated  the  performance  of  real-time  database  systems  developed 
using  the  prototyping  environment. 


7.1.  New  Approaches 

We  have  investigated  two  approaches  in  designing  real-time  database  systems.  The 
first  approach  is  to  use  advanced  database  techniques  to  improve  the  availability  and 
responsiveness  of  real-time  database  systems.  Specifically,  we  have  studied  techniques 
for  database  checkpointing  and  synchronization  using  priorities  and  multiple  versions  of 
data.  The  second  approach  is  to  exploit  semantic  information  about  transactions  and  data 
for  intelligent  scheduling.  This  approach,  combined  with  effective  use  of  data  replication, 
may  improve  responsiveness  and  reliability. 

The  need  for  having  checkpoint  mechanisms  in  distributed  database  systems  is  well 
known.  Checkpoints  are  performed  in  database  systems  to  save  a  consistent  state  of  the 
database  on  a  separate  secure  device.  In  case  of  a  failure,  the  stored  data  can  be  used  to 
restore  the  database.  Since  checkpointing  is  performed  during  the  normal  operation  of 
the  system,  interference  with  transaction  processing  must  be  kept  to  a  minimum.  It  is 
highly  desirable  that  users  are  allowed  to  submit  transactions  while  checkpointing  is  in 
progress  and  that  transactions  are  executed  in  the  system  concurrently  with  the 
checkpointing  process.  In  distributed  systems,  this  non-interference  requirement  makes 
checkpointing  complicated  because  we  need  to  consider  coordination  among  autonomous 
sites  of  the  system.  A  quick  recovery  from  failure  is  also  desirable  in  real-time 
applications  of  database  systems  that  require  high  availability.  To  achieve  quick 
recovery,  each  checkpoint  needs  to  be  globally  consistent  so  that  a  simple  restoration  of 
the  latest  checkpoint  can  bring  the  database  to  a  consistent  state.  To  make  each 
checkpoint  globally  consistent,  updates  of  a  transaction  must  be  either  included 
completely  in  one  checkpoint,  or  not  included  at  all. 

Recently,  the  possibility  of  non-interfering  checkpointing  mechanisms,  which  do 
not  interfere  with  transaction  processing  and  achieve  global  consistency,  have  been 
proposed.  They  are  very  promising  for  real-time  database  systems.  We  have  investigated 
and  extended  the  use  of  non-interfering  and  adaptive  checkpointing  techniques  for 
distributed  real-time  database  systems.  Our  research  effort  has  resulted  in  a  feasible 
solution  for  achieving  the  goals  of  checkpointing.  Currently,  we  are  implementing  a 
non-interfering  checkpointing  algorithm  and  are  using  the  prototyping  environment  to 
evaluate  the  performance  of  our  solution. 

Performance  of  real-time  database  systems  can  be  enhanced  by  synchronization 
using  priorities  and  multiple  versions  of  data.  In  a  real-time  database  system. 
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synchronization  protocols  must  not  only  maintain  the  consistency  constraints  of  the 
database  but  also  satisfy  the  timing  requirements  of  the  transactions  accessing  the 
database.  To  satisfy  both  the  consistency  and  real-time  constraints,  there  is  the  need  to 
integrate  synchronization  protocols  with  real-time  priority  scheduling  protocols.  A  major 
source  of  problems  in  integrating  the  two  protocols  is  the  lack  of  coordination  in  the 
development  of  synchronization  protocols  and  real-time  priority  scheduling  protocols. 
Due  to  the  effect  of  blocking  in  lock-based  synchronization  protocols,  a  direct 
application  of  a  real-time  scheduling  algorithm  to  transactions  may  result  in  a  condition 
known  as  priority  inversion. 

Priority  inversion  is  said  to  occur  when  a  high  priority  process  is  forced  to  wait  for 
an  indefinite  period  of  time  for  the  execution  of  a  lower  priority  process  to  complete. 
Priority  inversion  is  inevitable  in  transaction-based  systems.  However,  to  achieve  a  high 
degree  of  schedulability  in  real-time  applications,  priority  inversion  must  be  minimized. 

We  have  implemented  priority-based  schooling  algorithms  in  our  prototyping 
environment  and  investigated  technical  issues  associated  with  them.  One  of  the  issues  we 
studied  was  the  use  of  the  priority  ceiling  approach  as  a  basis  for  a  real-time  locking 
protocol  in  a  distributed  environment.  The  priority  ceiling  protocol  might  be 
implemented  in  a  distributed  environment  by  using  the  global  ceiling  manager  at  a 
specific  site. 

In  this  approach,  all  decisions  for  ceiling  blocking  are  performed  by  the  global 
ceiling  manager.  Therefore,  all  the  information  for  the  ceiling  protocol  is  stored  at  the 
site  of  the  global  ceiling  manager.  The  advantage  of  this  approach  is  that  the  temporal 
consistency  of  the  database  is  guaranteed  since  every  data  object  maintains  its  most  up- 
to-date  value.  While  this  approach  ensures  consistency,  holding  locks  across  the  network 
is  not  very  attractive.  Due  to  communication  delay,  locking  across  the  network  will  only 
force  the  processing  of  a  transaction  using  local  data  objects  to  be  delayed  until  access 
requests  to  the  remote  data  objects  are  granted.  This  delay  for  synchronization, 
combined  with  the  low  degree  of  concurrency  due  to  the  strong  restrictions  of  the  priority 
ceiling  protocol,  is  counter-productive  in  real-time  database  systems. 

An  alternative  to  the  global  ceiling  manager  approach  is  to  have  replicated  copies  of 
data  objects.  An  up-to-date  local  copy  is  used  as  the  primary  copy,  and  remote  copies  are 
used  as  the  secondary  read-only  copies.  In  this  approach,  we  assume  a  single  writer  and 
multiple  readers  model  for  distributed  data  objects.  This  is  a  simple  model  of 
applications  such  as  distributed  tracking  in  which  each  radar  station  maintains  its  view 
and  makes  it  available  to  other  sites  in  the  network.  Currently,  we  are  investigating  the 
trade-offs  between  these  two  approaches  for  distributed  real-time  database  systems  and 
their  performance. 

Maintaining  multiple  versions  of  data  objects  is  another  approach  to  improve 
system  responsiveness  by  increasing  the  degree  of  concurrency.  The  objective  of  using 
multiple  versions  is  to  reduce  the  conflict  probability  among  transactions  and  the 
possibility  of  rejection  of  transactions  by  providing  a  succession  of  views  of  data  objects. 
One  of  the  reasons  for  rejecting  a  transaction  is  that  its  operations  cannot  be  provided  by 
the  system.  For  example,  a  read  opeiation  has  to  be  rejected  if  the  value  of  data  object  it 
was  supposed  to  read  has  alreadv  been  overwritten  by  some  other  transactions.  Such 
rejections  can  be  avoided  by  keeping  old  versions  of  each  data  object  so  that  an 
appropriate  old  value  can  be  given  to  a  tardy  read  operation.  In  a  system  with  multiple 
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versions  of  data,  each  write  operation  on  a  data  object  produces  a  new  version  instead  of 
overwriting  it.  Hence,  for  each  read  operation,  the  system  is  able  to  select  an  appropriate 
version  to  read  by  flexibly  controlling  the  order  of  read  and  write  operations.  We  have 
investigated  severai  problems  that  must  be  solved  to  effectively  use  multiple  versions  of 
data  in  real-time  applications.  For  example,  selection  of  old  versions  for  a  given  read¬ 
only  transaction  must  ensure  the  consistency  of  the  state  seen  by  the  transaction.  In 
addition,  the  need  to  save  old  versions  for  read-only  transactions  introduces  a  storage 
management  problem,  i.e.,  methods  to  determine  which  version  is  no  longer  needed  so 
that  it  can  be  discarded. 

Since  multiversion  database  systems  maintain  timing  information  associated  with 
data  objects,  they  can  be  used  to  satisfy  temporal  requirements  of  real-time  transactions. 
The  temporal  consistency  requirement  is  specified  in  terms  of  the  desired  accuracy  of  the 
value  of  data  objects  to  be  read  by  the  transaction.  Temporal  consistency  provides  a  time 
interval,  relative  to  the  start  time  of  a  transaction,  during  which  accurate  states  of  data 
items  may  be  accessed.  For  example,  the  temporal  consistency  requirement  of  15 
indicates  that  the  data  items  accessed  by  the  transaction  cannot  be  older  than  15  time 
units  relative  to  the  start  time  of  the  transaction.  An  attempt  to  read  an  inaccurate  data 
item  (i.e.  one  whose  write  timestamp  is  outside  of  this  intern  al)  will  cause  the  transaction 
to  abort.  While  a  deadline  can  be  thought  of  as  providing  a  time  interval  as  a  constraint  in 
the  future,  the  temporal  consistency  specifies  a  temporal  window  as  a  constraint  in  the 
past.  We  have  developed  a  real-time  transaction  model  that  can  be  used  for  multiversion 
data  objects,  and  are  currently  investigating  the  scheduling  options  for  multiversion  real¬ 
time  databases. 


7.2.  Integration  of  a  Relational  Database  with  ARl'S 

ARTS  is  the  real-time  operating  system  kernel  being  developed  by  the  researchers 
at  the  Camegie-Mellon  University.  The  goal  of  the  ARTS  OS  is  to  provide  a  predictable, 
analyzable,  and  reliable  distributed  real-time  computing  environment.  We  have  been 
worWng  closely  with  the  ARTS  developers  and  Pat  Watson  at  the  IBM  Federal  Systems 
Division  to  integrate  a  relational  database  system  with  ARTS.  Our  goal  is  to  provide  a 
fully  functional  distributed  relational  database  manager  for  real-time  systems.  At  present, 
a  relational  database  server  and  client  objects  are  running  on  top  of  ARTS.  We  are 
investigating  methods  to  selectively  apply  consistency  management  techniques  and  to 
develop  a  multi-thread  server  for  this  real-time  database  manager.  In  addition,  we  arc 
expanding  the  functionalities  that  can  be  provided  by  the  real-time  relational  database 
manager. 


7.3.  Development  of  A  Database  Prototyping  Tool 

One  of  the  primary  reasons  for  the  difficulty  in  successfully  developing  and 
evaluating  new  techniques  for  distributed  database  systems  is  that  it  takes  a  long  time  to 
develop  a  system,  and  evaluation  is  complicated  because  it  involves  analyzing  a  large 
number  of  system  parameters  that  may  change  dynamically.  Prototyping  methods  can  be 
applied  effectively  to  the  evaluation  of  new  techniques  for  implementing  distributed 
database  systems.  By  investigating  design  alternatives  and  performance/reliability 
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characteristics  of  new  database  techniques,  we  can  provide  a  clear  understanding  of 
design  alternatives  with  their  costs  and  benefits  in  quantitative  measures.  Furthermore, 
database  technology  can  be  implemented  in  a  modular  reusable  form  to  enhance 
experimentation.  Although  there  exist  tools  for  system  development  and  analysis,  few 
prototyping  tools  exist  for  distributed  database  experimentation,  especially  for  distributed 
real-time  database  systems. 

A  prototyping  tool  to  implement  database  technology  should  be  flexible  and 
organized  in  a  modular  fashion  to  provide  enhanced  experimentation  capability.  A  user 
should  be  able  to  specify  system  configurations  such  as  the  number  of  sites,  network 
topology,  the  number  and  locations  of  processes,  the  number  and  locations  of  resources, 
and  the  interaction  among  processes.  We  use  the  client/server  paradigm  for  process 
interaction  in  our  prototyping  tool.  The  system  consists  of  a  set  of  clients  and  servers, 
which  are  processes  that  cooperate  for  the  purpose  of  transaction  processing.  Each  server 
provides  a  service  to  the  clients  of  the  system,,  where  a  client  can  request  a  service  by 
sending  a  request  message  to  the  corresponding  server. 

We  have  enhanced  the  previous  version  of  the  prototyping  tool  running  on  a  Sun 
workstation.  The  current  prototyping  tool  provides  concurrent  transaction  execution 
facilities,  including  two-phase  locking  and  timestamp  ordering  as  underlying 
synchronization  mechanisms.  A  series  of  experiments  have  been  performed  to  evaluate 
the  performance  of  multiversion  database  systems  and  priority-based  synchronization 
algorithms.  Using  the  prototyping  environment,  we  found  that  for  specific  workload, 
multiversion  database  systems  offer  performance  improvements  despite  the  additional 
CPU  and  I/O  costs  involved  in  accessing  the  old  versions  of  data.  We  have  also  found 
that  transaction  size  is  one  of  the  most  critical  parameters  that  affects  system 
performance.  Some  of  our  findings  have  been  presented  at  the  International  Conference 
on  Parallel  Processing  (August  1989),  and  will  be  presented  at  the  International 
Conference  on  Data  Engineering  (February  1990). 

We  have  implemented  the  priority-ceiling  protocol  and  compared  its  performance 
with  other  design  alternatives  such  as  the  two-phase  locking  protocol.  We  found  that  as 
the  transaction  size  increases,  there  is  little  impact  on  the  throughput  of  priority-ceiling 
protocol  over  a  range  of  transaction  sizes  and  over  the  workload  type.  Furthermore,  the 
percentage  of  deadline  missing  transactions  increases  sharply  for  the  two-phase  locking 
protocol  as  the  transaction  size  increases.  A  sharp  rise  was  expected,  since  the  probability 
of  deadlocks  would  go  up  with  the  fourth  power  of  the  transaction  size.  The  percentage 
of  deadline  missing  transactions  increases  much  more  slowly  as  the  transaction  size 
increases  in  the  priority-ceiling  protocol,  since  there  is  no  deadlock  in  priority-ceiling 
protocol  and  the  response  time  is  proponional  to  the  transaction  size  and  the  priority 
ranking. 


10 


APPENDIX 


The  StarLite  Operating  System 


Robert  P.  Cook* 
cook(2)cs.virginia.edu 
Department  of  Computer  Science 
University  of  Virginia 
Charlottesville,  VA  22903 
(804)  979-9943 


1.0  Introduction 

The  StarLite  project  [1,2,3]  has  four  research 
components  in  the  areas  of  prototyping,  operat¬ 
ing  systems,  database,  and  computer  network 
technology.  The  prototyping  environment, 
which  executes  on  Sun  workstations,  supports 
the  development  and  execution  of  software  for 
uni-  or  multi-processors,  as  well  as  distributed 
systems. 

Figure  1  illustrates  the  use  of  the  prototyping 
environment  during  a  test  session  for  the  StarLite 
operating  system.  The  figure  illustrates  our 
proprietary  UNIX*  implementation  "booting  up" 
on  a  six-node  virtual  network.  Once  the  vinual 
network  has  booted,  the  system  designer  can 
execute  test  programs,  collect  statistics,  or  exam¬ 
ine  the  system  state  using  the  builtin  debugger, 
which  is  illustrated  in  Figure  2.  We  have  invested 
a  good  deal  of  effort  in  building  the  prototyping 
system  to  create  what  we  feel  is  the  best  possible 
research  environment. 

The  purpose  of  this  paper  is  to  describe  our 
approach  to  designing  a  new  operating  system  for 
mission-critical  computing  and  to  review  some 


‘This  work  is  supponed  by  by  ONR  under  con¬ 
tract  N0(X)14-86-K0245  and  ARO  under  con¬ 
tract  DAAL03-87-K0090. 

*UNIX  is  a  registered  trademark  of  AT&T  Bell 
Laboratories. 


of  the  technology  issues  being  explored  as  pan  of 
the  StarLite  project. 

2.0  Operating  System  Interfaces 

In  this  Section,  we  describe  the  interface 
requirements  that  we  feel  would  be  most  appro¬ 
priate  for  a  mission-critical  operating  system 
solution.  Interfaces  are  important  because  they 
can  be  standardized  and  because  they  are  de¬ 
signed  to  outlive  implementations  and  machine 
architectures. 

It  is  now  widely  accepted  that  the  use  of  a  pro¬ 
cedural  interface,  such  as  the  C  library  for  UNIX, 
is  the  most  advantageous  method  for  presenting 
an  operating  system's  functionality  to  an  end 
user.  Such  an  interface  can  be  machine  and 
language  invariant.  These  arc  desirable  proper¬ 
ties  given  the  diversity  of  hardware/software 
used  by  today’s  defense  contractors. 

There  are  two  design  options  to  choose  from 
as  the  basis  for  an  interface  standard:  flat  and  lay¬ 
ered.  An  operating  system  with  a  flat  interface, 
such  as  UNIX,  is  essentially  closed;  that  is  none 
of  the  interfaces  used  in  the  implementation  can 
be  accessed.  Flat  interfaces  are  inflexible  and 
typically  trade  performance  and  control  for 
generality. 

A  layered  interface  specification,  such  as  the 
ISO  OSI  definition  for  computer  networks,  over¬ 
comes  the  deficiencies  of  the  traditional,  flat 


1 


IT  ratoy  to  go  ........ 

OCX  10  Moouu  inttialization  ctmrre 

FFCX  10  MOOULE  IHITIAtianOM  CIXPLETE 
eo  dttk  y/n-»OISX  MOOULC  IMTITALIZATtON  cadM-ETE 
OF  INOK  INITIALIZATICM 
OF  FILE  IMITIALIZATIOM 


'  vV<v»,vwk..\.<. 


m 


rtady  ta  go 

X  10  MXXJLE  tNXTIALIZATION  CM>LETE 
FER  10  MOOULE  INITIALIZATION  CaiPLETE 
raea  d<M  y/n'’n 

ISK  MOOULE  INITITALlZATIOH  CO^LETE 
W  OF  INODE  INITIALIZATION 
W  OF  FILE  INITIALIZATION 

*0 


NIT  raady  to  ge  aeflOMlS 
OCX  10  module  INITIAUZATION  CWPLETE 
FFER  10  MOOULE  INITIALIZATION  CIXPLETE 
racA  dtak  y/n^i 

ISK  MCDILE  INITITALIZATION  CIM>LETE 
NO  OF  INODE  INITIALIZATION 
NO  OF  file  initialization 
S  cat  FilaSya.iaooQ 


1  TO  EXIT 

2  fail 

3  fail  soft 

4  RESOOT 

5  PAnERH 


'Wi  S 


Figure  1,  A  Six-Node  StarLite  System 
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Figure  2.  The  StarLite  Symbolic  Debugger 


operating  system  interface  designs  by  allowing 
the  application  engineer  to  choose  an  interface 
layer  that  most  closely  fits  the  problem  to  be 
solved.  For  example,  if  UNIX  were  a  layered 
design,  it  would  be  possible  for  a  database  system 
to  manipulate  the  operating  system's  buffer  cache 
in  a  manner  that  has  long  been  requested  by 
implementors[4]. 

Access  to  low-level  interfaces  can  address  the 
performance  requirements  of  mission-critical 
software.  Another  advantage  of  a  layered  design 
is  that  layers  can  be  omitted  to  save  space.  For 
example,  if  an  application  does  not  use  files,  the 
file  system  could  be  omitted.  It  is  also  possible 
to  implement  layers  in  hardware  to  improve 
performance. 

The  StarLite  operating  system  is  based  on  a 
layered  design  with  standard  interfaces.  Two  of 
the  research  issues  are  how  to  partition  the  layers 
and  how  to  define  the  interfaces  at  each  layer. 

To  experiment  with  different  options,  we 
designed  and  implemented  a  UNIX-compatible 
operating  system  according  to  the  layering  prin¬ 
ciples  defined  by  ISO[5].  The  StarLite  UNIX  is 
proprietary  in  that  it  is  not  based  upon  nor  does  it 
contain  any  code  from  other  UNIX  implementa¬ 
tions.  We  have  rewritten  the  system  several  times 
to  try  different  layering  and  implementation 
strategies. 

We  have  found  interface  specification  to  be  a 
more  demanding  task  than  doing  the  implemen¬ 
tation.  In  other  words,  writing  a  monolithic  piece 
of  code  to  solve  a  problem  is  much  easier  than 
creating  a  layered  design  in  which  the  layers  are 
intended  to  form  functionally  complete  and  use¬ 
ful  subsets. 

We  have  found  two  other  problems  with  a 
layered  design  that  we  are  addressing  as  part  of 
our  research.  The  first  problem  is  the  overhead 
of  procedure  calls  through  multiple  layers  of 
software.  The  second  problem  results  from 
application  requirements,  such  as  protection 


features,  which  can  affect  interfaces  and  imple¬ 
mentations  at  a  number  of  layers.  Even  if  the  re¬ 
quirement  is  removed  at  a  higher  layer,  there  may 
be  unused  procedures  and  data  structures  at  lower 
layers  that  affect  performance.  Both  problems 
are  being  solved  by  improving  compiler  technol¬ 
ogy- 

3.0  Interface  Implementations 

Most  operating  system  implementations  are 
closed;  that  is,  the  user  cannot  and  probably 
should  not  modify  them.  The  StarLite  operating 
system  is  designed  to  support  an  arbitrary  num¬ 
ber  of  different,  validated  implementations  for  a 
given  interface.  As  a  result,  the  operating  system 
as  a  whole  follows  an  open  systems  architecture 
that  can  be  tuned  to  meet  application  require¬ 
ments.  Examples  of  different  implementation 
options  for  the  same  interface  specification  in¬ 
clude  CPU  and  disk  scheduling  algorithms  or 
hierarchical  versus  flat-file  name  interpretation. 

The  long-term  goal  of  the  StarLite  project  is  to 
create  an  operating  system  generator  that  could 
automatically  select  implementations  from  a 
module  library  based  on  specified  application 
requirements  and  a  given  target  architecture.  The 
first  step  toward  achieving  this  goal  is  to  create  a 
library  of  implementation  modules  suitable  for 
mission  critical  applications.  The  current  phase 
of  the  StarLite  project  is  concerned  with  creating 
such  a  library. 

4.0  A  Software  Backplane 

One  of  the  prerequisites  for  experimenting 
with  a  library  of  operating  system  components  is 
having  the  ability  to  add  and  delete  modules  or 
services.  Also,  we  felt  that  some  composition 
mechanism  would  be  necessary  to  achieve  the 
goal  of  creating  an  operating  system  generator. 

This  section  discusses  the  two  components  of 
the  StarLite  operating  system  that  make  up  what 
we  call  a  software  backplane.  The  two  compo¬ 
nents  are  a  composition  strategy  for  process 
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objects  and  a  dynamic  binding  option  for  system 
services.  The  first  component  is  used  to  create 
the  internal  structure  of  an  operating  system;  the 
second  is  used  to  connect  various  services  to  that 
underlying  structure. 

In  a  traditional  operating  system  implementa¬ 
tion,  such  as  UNIX,  the  properties  of  a  process  are 
stored  as  a  single  record.  Any  changes  in  one 
module  of  UNIX  require  a  change  in  the  ".h"  file 
for  the  shared  record.  The  result  is  that  all  the 
modules  in  the  system  must  then  be  recompiled. 

The  StarLite  composition  mechanism  elimi¬ 
nates  unnecessary  recompilations  by  binding 
properties  to  processes  dynamically.  The 
method  is  object  based  but  does  not  suppon 
inheritance.  Thus,  the  suppon  code  is  small  and 
fast 

In  StarLite,  there  is  only  one  class  of  object,  a 
process.  Each  process  object  can  be  composed  of 
a  limited  number  of  properties  that  can  be  con¬ 
nected  to  it  in  any  order  and  at  any  time. 

When  the  operating  system  boots  up,  each 
module  has  been  statically  linked  to  the  code  of 
the  modules  that  it  depends  on.  However,  each 
module  dynamically  connects  its  data  type  to  the 
process  object  using  a  low-level  creat  system 
call.  For  example,  'creat(">process/Timing- 
Info")'  would  append  a  set  of  timing  properties 
of  a  certain  size  to  every  process  object.  The 
property  fields  are  created  only  once  when  the 
system  boots  up.  Also  at  boot  time,  the  modules 
that  use  a  particular  property  retrieve  the  location 
of  its  fields  with  an  open  system  call,  such  as 
'open(">process/TimingInfo")'.  Again,  this  only 
occurs  once.  Note  that  the  net  effect  is  the  same 
as  being  able  to  declare  a  RECORD  structure 
with  the  field  location  bound  at  runtime. 

In  order  to  use  the  Timinglnfo  property,  a 
module  must  execute  a  read  system  call  to  re¬ 
trieve  a  pointer  to  or  copy  of  the  desired  field, 
depending  on  the  semantics.  The  contents  of  the 
field  can  then  be  manipulated  with  the  same 


efficiency  as  the  traditional  UNIX  process  struc¬ 
ture.  For  example,  "p'^.nextTimeOut"  would 
retrieve  a  field  from  the  Timinglnfo  record. 

It  is  also  possible  to  associate  managers  with 
properties.  When  a  process  object  is  closed,  the 
managers  are  notified  one  at  a  time  so  that  the 
individual  fields  may  be  closed.  Forexample,  the 
exit  system  call's  implementation  is  unaware  that 
there  is  a  file  system  associated  with  a  process. 
When  the  manager  of  the  file-system  property  is 
invoked  as  the  result  of  a  process  exit,  it  does  its 
own  cleanup  by  closing  all  open  files. 

Managers  can  also  be  used  to  monitor  the 
actions  on  fields  for  debugging  purposes.  This  is 
somewhat  equivalent  to  the  probe  points  used  on 
hardware  backplanes. 

The  second  component  of  StarLite's  software 
backplane  design  is  its  user  services  interface. 
The  operating  system  acts  as  an  agent  between 
user  interfaces  and  modules  that  provide  serv¬ 
ices.  The  connection  between  the  two  is  by 
means  of  messages  in  which  the  operations  re¬ 
quested  can  be  open,  share,  read,  write,  rcon- 
trol,  wcontrol.  and  close.  However,  the  inter¬ 
pretation  of  the  message  is  strictly  up  to  the 
service  modules.  Thus,  the  system  implementor 
can  create  an  arbitrary  number  of  user  interfaces 
and  an  arbitrary  number  of  implementations  of 
those  interfaces. 

For  example,  assume  that  a  user  opens  "/dev/ 
pipe".  The  result  is  that  an  action  procedure  is 
dynamically  associated  with  the  lO  field  in  the 
user's  process  object.  Next,  an  open  message  is 
constructed  and  sent  to  the  Pipe  module.  The 
return  value,  which  represents  two  file  descriptor 
tags  for  the  read/write  ends  of  the  pipe,  is  sent  to 
the  user's  process  as  a  result.  The  applications 
engineer  can  choose  from  a  variety  of  pipe  imple¬ 
mentations  by  using  different  names.  Note  that 
dynamic  binding  need  not  entail  demand  load¬ 
ing;  the  implementation  modules  can  be  loaded 
with  the  boot  image  if  desired. 
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The  user  services  interface  has  one  other 
aspect,  the  notion  of  context,  that  we  feel  is  im¬ 
portant  for  mission  critical  computing.  A  context 
defines  the  mechanism  by  which  names  are  inter¬ 
preted.  In  the  StarLite  implementation,  all  name 
resolution  is  accomplished  by  messages  sent  to 
context  services  by  means  of  action  procedure 
calls. 

As  a  result,  any  path  name  syntax  and  any 
effect  can  be  realized.  For  example,  the  dynamic 
service  binding  is  implemented  by  a  context 
module.  Contexts  can  also  be  used  for  perform¬ 
ance  enhancement.  For  instance,  the  standard 
UNIX  implementation  of  path  name  resolution 
can  result  in  lengthy  and  unpredictable  disk 
accesses.  Critical  read-only  file  names  could  be 
resolved  by  a  context  so  that  their  index  and  data 
blocks  wen?  locked  in  memory  thereby  achieving 
unit  access  dmes. 

We  feel  that  adaptability  and  extensibility  are 
desirable  properties  for  operating  systems  to 
suppon  mission  critical  computing.  The  tradi¬ 
tional  methods  of  changing  interfaces  as  new 
application  and  technological  requirements  arise 
are  unacceptable.  StarLite  achieves  flexibility 
without  sacrificing  performance. 

5.0  Technology  Issues 

In  this  section,  we  discuss  some  of  the  StarLite 
project's  research  in  operating  system  implemen¬ 
tation  techniques.  The  areas  discussed  include  a 
Volume  Storage  Formal  standard  (VSF),  syn¬ 
chronization,  and  resource  allocation. 

5.1  A  VSF  standard 

We  feel  that  one  of  the  key  aspects  of  a  suppon 
stategy  for  mission  critical  computing  is  a  stan¬ 
dard  format  for  disk  volumes.  The  advantages 
are  that  this  standard  could  be  implemented  in 
hardware  for  high-performance  and  that  files 
stored  on  any  volume  could  be  accessed  by  any 
operating  system. 

One  way  to  achieve  this  goal  would  be  to 


dictate  the  use  of  a  standard  file  system  for  all 
critical  computing.  This  may  not  be  feasible  so 
we  have  investigated  the  lesser  goal  of  standard¬ 
izing  file  manipulation,  indexing,  and  disk  space 
allocation.  Each  vendor's  operating  system  is 
then  presented  with  a  standard  interface  to  a  vol¬ 
ume. 

At  the  current  time,  the  VSF  standard  is  de¬ 
signed  to  maintain  the  integrity  of  a  volume’s  bit 
map,  file  descriptors  and  index  blocks.  It  is  up  to 
each  operating  system  to  maintain  the  consis¬ 
tency  of  other  information,  which  may  be  arbi¬ 
trary.  For  example,  UNIX  information,  such  as 
access  times  or  an  owner's  id,  could  be  manipu¬ 
lated  freely  through  the  interface.  Each  operating 
system  is  free  to  add  whatever  information  that  it 
wants  to  either  file  descriptors  or  index  blocks. 

This  flexibility  is  achieved  by  partitioning  the 
descriptor  and  index  blocks  into  two  parts.  One 
part  can  be  manipulated  arbitrarily  by  the  host 
operating  system  through  a  protected  interface. 
The  second  part  can  only  be  used  in  cenain 
restricted,  but  always  safe,  ways.  The  integrity  of 
the  protected  information,  which  contains  disk 
block  addresses,  guarantees  the  integrity  and 
recoverability  of  a  volume's  data. 

The  protected  part  of  an  index  block  or  file 
descriptor  contains  index  slots.  Each  index  slot 
can  identify  an  extent,  which  can  be  as  small  as 
one  block,  or  another  index  block.  For  high 
performance  applications,  each  file  can  be  imple¬ 
mented  as  a  single  extent  consisting  of  a  file 
descriptor  followed  by  the  data.  This  organiza¬ 
tion  avoids  the  overhead  associated  with  the 
traditional  UNIX  implementation. 

The  design  also  supports  the  creation  of  multi¬ 
level  index  structures.  Since  an  operating  system 
can  store  into  the  unprotected  part  of  an  index 
block,  it  is  possible  to  efficiently  implement 
keyed  access  methods,  such  as  B-trees,  that  do 
not  "fit"  into  the  UNIX  filesystem  model.  Al¬ 
though  we  have  not  tried  it  yet,  it  is  also  possible 
to  create  indices  that  span  multiple  files. 


The  proposed  standard  is  flexible,  supports 
volume  interchange,  and  can  be  used  to  achieve 
predictable,  high-performance  operation.  Pro¬ 
prietary  file  systems  can  still  be  defined,  but  low- 
level  access  to  data  across  systems  is  guaranteed. 

5.2  Synchronization 

The  StarLite  operating  system  is  imple¬ 
mented  using  the  hierarchy  of  synchronization 
abstractions  listed  in  Figure  3.  Operators  lower 
in  the  hierarchy  have  higher  performance  but 
have  undesirable  side-effects  associated  with 
their  use.  Disabling  interrupts  to  protect  critical 
sections  is  fast  (usually  one  machine  instruction) 
but  its  indiscriminate  use  can  increase  interrupt 
latency  times,  which  in  turn  can  affect  critical 
event  response  times.  The  technique  is  also 
inappropriate  for  multiprocessors  where  dis¬ 
abling  interrupts  on  one  processor  has  no  effect 
on  the  execution  of  the  others.  The  use  of 
DISABLE  in  StarLite  is  restricted  to  two  stan¬ 
dard  modules  plus  any  device  drivers  that  imple¬ 
ment  device  synchronous  operations. 

As  a  result,  StarLite  minimizes  interrupt  la¬ 
tency.  Furthermore,  the  fine  granularity  of  lock¬ 
ing  suppons  kernel  preemption  as  well  as  simul¬ 
taneous  system  or  10  operations. 


Synchronization  Level 

Operation 

DISABLE/RESTORE  1 

Spin  Locks  1 

Semaphores  2 

Monitors  3 

Blocking  4 


Figure  3.  Synchronization  Operations 

At  the  higher  layers  of  the  StarLite  implemen¬ 
tation,  Semaphores  are  used  in  protect  critical 
sections  that  consist  of  straight-line  code;  Moni¬ 


tors  are  used  for  critical  sections  with  blocking 
conditions;  and  Blocking  operations  are  used  for 
the  cases  in  which  a  delayed  thread  can  be 
swapped  out.  For  swapped,  blocked  threads,  the 
unblock  operation  is  reflected  as  a  state  change 
that  defers  the  wakeup  signal  until  the  process  is 
swapped  in  and  scheduled  to  run. 

In  addition  to  experimenting  with  fine¬ 
grained  locking  and  synchronization  techniques 
for  operating  system  construction,  we  are  also 
investigating  the  enhancements  necessary  to 
support  real-time.  Two  areas  of  interest  are 
priority  inheritance  schemes  and  an  integrated 
view  of  criticality. 

5.3  Resource  allocation 

Management  of  resources  is  one  of  the  most 
difficult  problems  to  solve  in  order  to  produce  a 
full-function  UNIX  operating  system  that  is  ca¬ 
pable  of  providing  hard,  real-time  guarantees. 

The  problem  occurs  when  a  low-priority 
process  holds  a  resource  requested  by  a  high- 
priority  process.  If  the  resource  cannot  be 
preempted  or  released  quickly  enough,  the  high- 
priority  process  can  miss  its  deadline.  The  sec¬ 
ond  part  of  the  real-time  guarantee  problem  is  to 
make  system  timings  predictable  in  the  absence 
of  resource  contention. 

The  current  StarLite  implementation  attempts 
to  guarantee  that  the  highest-priority  process 
executes  in  an  interference-free  manner  as  long 
as  its  resources  are  disjoint  from  other  processes. 
For  example,  disk  writes  would  circumvent  disk 
scheduling  and  would  supercede  other  requests. 

Our  approach  to  the  resource  contention  prob¬ 
lem  is  based  on  priority -ordered  avoidance[6]. 
This  technique  requires  that  tasks  with  "hard" 
deadlines  submit  claims  describing  future  ac¬ 
tions  and  timing  requirements.  The  system  then 
guarantees  that  the  deadline  will  be  met  as  long 
as  the  task  does  not  exceed  its  computation  and 
resource  limits  and  neither  the  hardware  nor 


software  fail. 
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Each  process  with  "hard"  deadlines  must 
submit  a  claim  list  identifying  the  resources  to  be 
used  and  the  timing  requirements.  The  system 
then  associates  a  data  structure  with  each  re¬ 
source  that  restricts  access  by  competing  proc¬ 
esses  during  critical  periods.  The  key  to  success 
is  making  the  avoidance  test  fast  enough,  which 
is  achieved  by  using  priority  to  totally  order  the 
necessary  comparisons. 

6.0  Summary 

StarLite  is  a  research  project  that  is  exploring 
new  ideas  for  operating  system  structuring,  inter¬ 
face  design,  analysis,  and  implementation.  It  is 
one  of  the  few  projects  that  provides  a  standard 
UNIX  interface  together  with  an  implementation 
strategy  that  addresses  the  critical  system  needs 
of  high-performance,  openess,  and  predictabil¬ 
ity.  Its  layered  interface  design  and  software 
backplane  implementation  strategy  make  the 
StarLite  system  unique.  Furthermore,  its  distri¬ 
bution  as  part  of  the  StarLite  programming  envi¬ 
ronment  means  that  any  researcher  with  a  SUN 
workstation  can  work  with  the  StarLite  design  to 
make  it  better. 


[1]  Cook,  R.P.,  StarLite,  A  Visual  Simulation 
Package  for  Software  Prototyping,  Second 
ACMSIGSOFT/SIGPLAN  Symposium 
on  Practical  Software  Development  En¬ 
vironments,  (Dec.  1986)  102-110,  also 
SIGPLAN  Notices  22,  l(Jan.  1987). 

[2]  Cook,  R.P.,  StarMod,  A  Language  for 
Distributed  Programming,  reprinted  in 
Concurrent  Programming,  Addison- 
Wesley,  edited  by  N.  Gehani  and  A.D. 
McGettrick,  (1988). 

[3]  Son,  S.H.  and  R.P.  Cook,  Scheduling  and 
Consistency  in  Real-Time  Database 
Systems,  Sixth  IEEE  Workshop  on  Real- 
Time  Software  and  Operating  Systems, 
(May  1989)  42-45. 

[4]  Gray,  J.N.,  Notes  on  Database  Operating 
Systems,  in  Operating  Systems-An 
Advanced  Course,  edited  by  Bayer,  Gra¬ 
ham,  Seegmuller ,  (1979). 

[5]  Zimmennann,  H.,  OSI  Reference  Model- 
The  ISO  Model  of  Architecture  for  Open 
Systems  Interconnection,  IEEE  Transac¬ 
tions  on  Communications  COM-28, 
(April  1980)  425-432. 

[6]  Munch-Anderson,  B.  and  T.U.  Zahle, 
Scheduling  According  to  Job  Priority  With 
Prevention  of  Deadlock  and  Permanent 
Blocking,  Acta  Informatica  6,  3(1977) 
153-175. 


RDB, 

An  Open,  Real-Time,  Relational  Database  Kernel 


Robert  R  Cook*  and  Sang  H.  Son 
Department  of  Computer  Science 
University  of  Virginia 
Charlottesville,  VA  22903 


1.0  Introduction 

In  this  paper,  we  discuss  the  attributes  of  RDB, 
which  is  an  "open",  real-time,  relational  database 
kernel.  RDB  was  implemented  using  the  Star- 
Lite[l]  software  development  environment.  It 
was  inspired  by  the  SDB  system[2]  created  by 
Betz  and  Smith. 

RDB  is  intended  for  use  in  embedded  systems 
with  requirements  for  high-performance  and 
real-time  priority  and  predictability  guarantees. 
RDB  is  a  tool  that  can  be  used  to  achieve  these 
goals  but  it  is  the  user's  responsibility  to  use  it 
properly.  For  example,  RDB  is  completely  reen¬ 
trant  and  can  be  preempted  in  one  context  switch 
time  to  perform  an  action  for  a  high  priority 
process.  Thus,  a  query  can  be  interrupted  for  an 
update  action  and  then  restarted.  However,  if  the 
low  priority  process  holds  locks  that  the  high 
priority  process  needs  (priority  inversion),  it  is 
the  user’s  responsibility  to  resolve  the  difficulty. 

RDB  is  an  "opcn"[3]  system.  It  is  imple¬ 
mented  as  a  hierarchy  of  modules  that  are  struc¬ 
tured  so  that  they  can  be  easily  modified  or 
replaced.  Furthermore,  RDB  does  not  depend  on 
any  operating  system  services.  As  a  result,  it  is 
possible  to  manipulate  ROM  or  memory-resi- 
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dent  databases  without  having  to  depend  on  the 
access  methods  traditionally  provided  by  operat¬ 
ing  systems.  As  a  final  point,  RDB  uses  up- 
calls[41  to  implement  late  binding  of  query  and 
I/O  operations. 

By  providing  the  user  of  RDB  fine-grained 
control  over  its  operation,  it  is  simple  to  select 
implementation  strategies  that  achieve  perform¬ 
ance  and  predictability  goals.  This  can  be  con¬ 
trasted  with  traditional  database  systems  that 
operate  as  closed  boxes,  often  with  poor  perform¬ 
ance  and  predictability  characteristics. 

The  following  sections  describe  the  relational 
model  supported  by  RDB  and  a  simple  example 
that  illustrates  its  use. 

2.0  The  RDB  Model 

The  RDB  kernel  supports  the  following  ab¬ 
stract  data  types:  Schema,  Relation,  Attribute, 
Cursor,  SortKey,  SortList,  SortLists,  Selection, 
and  Expression.  Figure  1  illustrates  the  relation¬ 
ships  among  the  various  types. 

A  Schema  in  RDB  describes  the  tuple  format 
in  terras  of  the  position  and  type  of  the  Attribute 
fields.  At  present,  the  only  field  types  are  text  and 
numeric.  The  schema  is  disjoint  from  the  data 
composing  a  Relation  in  order  to  provide  options 
for  real-time  systems  that  are  not  normally  found 
in  traditional  database  systems.  For  example,  it 
is  possible  to  define  a  Relation’s  content  as  a  file. 
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Figure  1.  The  RDB  DataBase  Model 


but  is  is  also  possible  to  define  derived  relations. 
That  is,  the  tuples  making  up  a  relation  can  be 
computed  as  requested  or  they  can  be  generated 
from  a  data  stream  of  sensor  inputs[5].  Figure  1 
illustrates  a  relation  composed  from  three  attrib¬ 
utes:  color  (text),  part  number  (numeric),  and 
cost  (numeric).  Each  Attribute  specifies  the  field 
name  and  type  as  well  as  its  position  in  the  tuple 
and  its  length  in  bytes.  The  file  consists  of  475 
tuples. 

Once  a  schema  has  been  defined  and  a  relation 
selected,  any  number  of  Cursors  can  be  opened. 
A  cursor  identifies  a  tuple  in  a  relation.  Cursors 
are  used  to  "mark"  the  positions  at  which  I/O  op¬ 
erations  are  to  occur  in  a  database.  In  RDB,  the 
actual  I/O  is  performed  using  upcalls.  An  upcall 
is  an  invocation  of  a  procedure  variable  that  is 


bound  after  RDB  is  loaded  but  before  a  relation 
is  accessed.  The  procedures  to  be  invoked  are 
specified  when  a  relation  is  "connected"  to  the 
system  for  I/O. 

Upcalls  give  the  system  the  flexibility  to  im¬ 
plement  ROM  or  memory-resident  databases, 
derived  or  computed  relations,  and  relations 
based  on  data  streams.  In  essence,  each  relation 
can  have  a  set  of  access  procedures  that  are  tuned 
to  meet  system  performance  and  predictability 
goals.  The  system  provides  several  traditional 
access  methods,  which  the  user  is  free  to  modify 
or  to  augment. 

The  SortKey  is  a  central  data  structure  in  most 
database  implementations.  It  defines  the  order 
relation  to  be  used  for  an  attribute  when  it  is 
accessed.  For  example,  the  "part  number"  attrib- 


ute  in  Figure  1  is  sorted  in  descending  numeric 
order. 

A  SortList  identifies  a  relation  and  a  list  of  sort 
keys.  The  keys  represent  the  projection  of  the 
relation  over  which  a  particular  operator  is  to  be 
applied.  If  an  attribute  is  referenced  through  a 
secondary  index,  the  projection  can  sometimes 
be  loaded  by  referring  to  the  index  rather  than 
reading  the  tuples  of  the  original  relation.  For  a 
real-time  system,  the  update  costs  associated 
with  a  secondary  index  must  be  compared  with  its 
efficiency  advantages  for  query  processing. 

A  SortLists  is  a  list  of  Sortlist  elements,  where 
each  SoitList  is  itself  a  list  of  SortKeys.  The 
SortLists  data  type  represents  the  list  of  projec¬ 
tions  of  relations  that  participate  in  multi-relation 
operations,  such  as  a  join. 

The  join  operation  is  implemented  by  selec¬ 
tion  based  on  one  or  more  expressions.  A  Selec¬ 
tion  data  structure  contains  the  input  SortLists 
(relations  and  keys)  as  well  as  the  upcall  proce¬ 
dure  variables  that  are  used  to  filter  the  tuples  in 
the  input  relations.  Filtering  can  be  applied 
during  selection  either  when  each  tuple  of  a 
relation  is  input  or  when  one  tuple  in  each  relation 
has  been  input 

Tuple  filtering  can  be  combined  with  expres¬ 
sion  filtering  to  achieve  results  that  are  not  pos¬ 
sible  in  a  traditional  database  system.  For  ex¬ 
ample,  any  tuple  filter  has  the  ability  to  terminate 
selection.  As  a  result,  a  query  that  cannot  be 
completed  by  its  original  deadline  can  return  a 
partial,  or  less  accurate  result,  and  still  meet  its 
timing  constraints. 

The  Expression  data  type  is  implemented  in  a 
fashion  that  makes  it  orthogonal  to  the  rest  of  the 
database  kernel.  It  operates  on  code  strings  such 
as  "eOOOO  s04blue  =",  which  compares  field  zero 
in  relation  zero  to  the  string  "blue".  If  they  are 
equal,  the  top  of  the  operand  stack  is  set  to  the 
Boolean  value  TRUE. 


As  with  several  other  components  of  RDB ,  the 
Expression  module  uses  an  upcall  procedure  to 
bind  the  interpretatation  of  the  "e"  operator  (load 
external).  The  user  typically  provides  an  Expres¬ 
sion  with  a  procedure  that  will  return  attribute 
values  when  presented  with  the  arguments  to 
"load  external".  However,  the  Expression  mod¬ 
ule  does  not  know  that  it  is  being  used  by  a 
database  system. 

As  a  result,  the  user  of  RDB  is  free  to  generate 
the  operands  of  an  expression  in  a  manner  that  is 
application  dependent.  For  example,  a  tradi¬ 
tional  database  system  would  lock  out  updates  to 
a  relation  while  a  query  was  in  progress.  Locking 
can  result  in  priority  inversion  which  is  an  anath¬ 
ema  in  real-time  systems. 

With  RDB,  the  selection  filters  and  expres¬ 
sion  processing  can  be  specified  such  that  "com¬ 
pensation"  is  possible.  That  is,  the  updates  are 
made  to  the  relation  and  are  simultaneously 
factored  into  the  query  so  that  neither  the  query 
process  nor  the  update  process  are  delayed.  In  a 
similar  fashion,  if  records  are  deleted,  the  effect 
may  be  "subtracted"  from  a  query  in  progress. 

3.0  An  RDB  Example 

The  following  example  illustrates  the  use  of 
RDB  and  the  Phoenix  real-time  operating  sys¬ 
tem,  which  is  also  part  of  StarLite,  to  implement 
a  cyclic  process  that  prints  a  "parts"  report  once 
every  hour  starting  at  a  particular  hour. 

Phoenix  provides  an  operator  that  transforms 
a  procedure  into  a  lightweight  thread.  Other 
operators  allow  a  thread  to  set  or  change  its 
priority  and  to  delay  until  a  selected  time  has 
arrived.  Even  though  a  delay  operator  for  a 
relative  amount  of  time  has  the  same  expressive 
power,  we  have  found  that  using  an  absolute  time 
specification  results  in  programs  that  are  more 
likely  to  perform  as  the  user  intended.  This  is 
particularly  true  for  very  fine-grained  timing 
control  operations. 

PROCEDURE  reportGcnerator(iniiHr:CARDINAL): 

BEGIN 


3 


SetPriority(SelfO.  7); 

LOOP 

AtSecond.At(initHr*60); 
IFinUHr=24THENinilHr  ;=  1; 

ELSE  initHr initHr-t'l; 

END; 

printO: 

END; 

END  reportGenerator, 

Figure  2.  A  Cyclic  Report  Generator 

PROCEDURE  printO; 

VAR  r :  Relation; 
s :  SortList; 
input :  SoitLiats; 
sel :  Selection; 
e :  Expression: 

BEGIN 

r  :=  RFind  ("partsFile"); 

ROpenSoft(r,  s); 

RA(klKey(s.  "cost",  "ascending"); 

RAddKey(s,  "color". ""); 

ROpenSortLists(input); 

RAddSortList(input,  s); 

sel  :=  ROpenSelect(input,  e,  FIONULL,  printr); 

RIntcrpreLROpen(e,  "eOOOO  s04blue  =",  f,  sel); 

RSelect(sel); 

RInteTpret.Oose(e); 

RQos^elect(sel); 

RCloseSortListsC  input): 

END  print; 

Figure  3.  Initiate  Record  Selection 

In  Figure  3,  the  "print"  procedure  associates  a 
Relation  variable  with  the  database  file  and 
schema.  Next,  the  SortLists  is  constructed  to 
describe  the  projections  of  the  input  relations  that 
are  to  be  printed  (in  this  case  just  one  relation). 
Notice  that  the  "cost"  and  "color"  keys  are  per¬ 
muted  from  the  storage  order.  In  general,  a 
SortList  can  permute  the  keys  for  both  the  input 
and  output  relations. 

After  the  SortLists  is  initialized,  the  Selection 
and  Expression  data  structures  are  initialized. 
One  of  the  arguments  used  to  create  the  "sel" 
variable  is  the  upcall  procedure  "printr"  that 
actually  produces  the  report.  One  of  the  argu¬ 


ments  (procedure  "f ')  to  create  an  expression  is 
an  upcall  procedure  that  converts  attributes  in  the 
input  tuples  into  expression  operands.  Figure  4 
presents  an  outline  of  "f '  and  "printr". 

PRCXTEDURE  f(relation,  attrCARDINAL 

arg:ADDR£SS;  VAR  (‘out*)  o:Opcrand); 

(•  set  Operand  to  field  "attr"  in  "relation"  •) 

VAR  s :  Selection; 
pT :  pTuple; 
pSK :  pSortKey; 

BEGIN 

s  ;=  arg;  (*  remembered  by  Expression  •) 
pSK  :=  RGetSKey(s,  relation,  attr,  pT  (*out*) ); 
o.pT  ;=  pT; 

o.o^set  :=  pSK''.offset; 
o.length  :=  pSK^.length; 

ENDf; 

PROCEDURE  printr(s:  Selection;  aig:ADDR£SS); 

(*  filter  items  to  be  printed  in  the  report  *) 

VAR  e :  Expression; 
pT :  pTuplc; 
pO ;  pOperand; 

BEGIN 

e  :=  arg;  (*  remembered  during  selection  *) 
pO  ;=  RInterpret£valuate(e); 

IF  NOT  pO^.b  THEN  RETURN;  END; 

(•  FOR  EACH  SortList  in  s  DO 
pT  :=  RCetSBuf(s,  i); 

Write  the  selected  aitribures  in  pT'' 

•) 

InOuLWriteLn;  (•  terminate  output  line  *) 

END  printr, 

Figtire  4.  Upcall  Procedures 

When  one  tuple  has  been  read  for  each  relation 
selected  as  input,  the  "printr"  procedure  is  in¬ 
voked.  This  procedure  in  turn  evaluates  an 
expression  to  select  the  tuples  to  be  printed  in  the 
report. 

Whenever  the  expression  evaluator  encoun¬ 
ters  the  "Load  External",  "e"  operator,  it  per¬ 
forms  an  upcall  to  the  "f  procedure  that  was 
passed  as  an  argument  to  ROpen  to  create  an  Ex¬ 
pression  variable.  One  of  the  arguments  to  "f '  is 
the  address  of  an  operand  descriptor.  It  is  the 
procedure's  responsibility  to  map  the  relation  and 
attribute  indices  to  a  SortKey.  The  SortKey  is 
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then  used  to  retrieve  an  attribute  value,  which  is 
passed  back  to  the  evaluator  as  an  operand. 

When  the  evaluator  completes,  it  returns  an 
operand  descripttv  for  the  value  on  the  top  of  the 
operand  stack.  For  the  "printr”  procedure,  the 
result  is  a  Boolean  value.  If  it  is  true,  the  fields 
are  printed  in  the  report.  If  it  is  false,  the  fields 
are  ignored  and  selection  continues. 

RDB  inqjlements  a  number  of  very  flexible 
options  for  expression  evaluation  that  space  does 
not  permit  us  to  describe.  For  real-time  systems, 
the  two  most  important  are  expression  preemp¬ 
tion  and  contexmal  reevaluation. 


RDB  does  not  currently  support  transactions, 
locking,  or  recovery.  It  can,  however,  operate  on 
either  local  relations  or  remote  files  by  using 
RPC.  We  are  implementing  additional  function¬ 
ality  as  part  of  a  layered  design  for  database 
operations.  The  layers  are  implemented  so  that 
the  end-user  can  add  or  subtract  features  to  meet 
the  performance  or  timing  requirements  of 
embedded  systems. 
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In  the  former,  any  expression  can  be 
preempted  at  any  time  by  more  critical  expres¬ 
sions  or  other  system  actions.  Using  contextual 
reevaluation,  it  is  possible  for  a  query  to  modify  , 
its  expression  while  it  is  being  evaluated  to  return  ^  ^ 
a  less  accurate  result  in  order  to  meet  timing 
constraints. 

4.0  Summary 

The  RDB  database  kernel  is  intended  for  use 
in  stand-alone  applications  that  have  "hard"  tim¬ 
ing  requirements.  It  is  an  "open"  system  in  that 
the  user  can  manipulate  interfaces  not  normally 
available  in  traditional  database  systems  to 
"tune"  performance  to  application  requirements. 

The  use  of  upcalls  also  adds  great  flexibility  to 
both  selection  and  expression  processing  op¬ 
tions. 
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1.  Introduction 

As  computers  are  becoming  essential  part  of 
real-time  systems,  real-time  computing  is  emerging  as 
an  important  discipline  in  computer  science  and 
engineering  [Shin87].  The  growing  importance  of 
real-time  computing  in  a  large  number  of  applications, 
such  as  aerospace  and  defense  systems,  industrial  auto¬ 
mation,  and  nuclear  reactor  control,  has  resulted  in  an 
increased  research  effort  in  this  area.  Since  any  kind  of 
computing  needs  to  access  data,  methods  for  designing 
and  implementing  database  systems  that  satisfy  the 
requirement  of  timing  constraints  in  collecting,  updat¬ 
ing,  and  retrieving  data  play  an  important  role  in  the 
success  of  real-time  systems. 

Researchers  working  on  developing  new  real¬ 
time  systems  based  on  distributed  system  architecture 
have  found  out  that  database  managers  are  assuming 
much  greater  importance  in  real-time  systems  [Son88]. 
One  of  the  characteristics  of  current  database  managers 
is  that  they  do  not  schedule  their  transactions  to  meet 
response  requirements  and  they  commonly  lock  data 
tables  indiscriminately  to  assure  database  consistency. 
Lxxks  and  time-driven  scheduling  are  basically  incom¬ 
patible.  Low  priority  uansactions  can  and  will  block 
higher  priority  transactions  leading  to  response  require¬ 
ment  failures.  New  techniques  are  required  to  manage 
database  consistency  which  are  compatible  with  time- 
driven  scheduling  and  the  essential  system  response 
predictability/analyzability  it  brings.  One  of  the  pri¬ 
mary  reasons  for  the  difficulty  in  successfully  develop¬ 
ing  and  evaluating  a  distributed  database  system  is  that 
it  takes  a  long  time  to  develop  a  system,  and  evaluation 
is  complicated  because  it  involves  a  large  number  of 


system  parameters  that  may  change  dynamically. 

A  prototyping  technique  can  be  applied  effec¬ 
tively  to  the  evaluation  of  control  mechanisms  for  dis¬ 
tributed  database  systems.  A  database  prototyping  tool 
is  a  software  package  that  supports  the  investigation  of 
the  properties  of  a  database  control  techniques  in  an 
environment  other  than  that  of  the  target  database  sys¬ 
tem.  The  advantages  of  an  environment  that  provides 
prototyping  tools  are  obvious.  First,  it  is  cost  effective. 
If  experiments  for  a  twenty-node  distributed  database 
system  can  be  executed  in  a  software  environment,  it  is 
not  necessary  to  purchase  a  twenty-node  distributed 
system,  reducing  the  cost  of  evaluating  design  alterna¬ 
tives.  Second,  design  alternatives  can  be  evaluated  in  a 
uniform  environment  with  the  same  system  parameters, 
making  a  fair  comparison.  Finally,  as  technology 
changes,  the  environment  need  only  be  updated  to  pro¬ 
vide  researchers  with  the  ability  to  perform  new  experi¬ 
ments. 

A  prototyping  environment  can  reduce  the  time 
of  evaluating  new  technologies  and  design  alternatives. 
From  our  past  experience,  we  assume  that  a  relatively 
small  portion  of  a  typical  database  system's  code  is 
affected  by  changes  in  specific  conuol  mechanisms, 
while  the  majority  of  code  deals  with  intrinsic  prob¬ 
lems,  such  as  file  management.  Thus,  by  properly  iso¬ 
lating  technology-dependent  portions  of  a  database  sys¬ 
tem  using  modular  programming  techniques,  we  can 
implement  and  evaluate  design  alternatives  very 
rapidly.  Although  there  exist  tools  for  system  develop¬ 
ment  and  analysis,  few  prototyping  tools  exist  for  dis¬ 
tributed  database  experimentation.  Especially  if  the 
system  designer  must  deal  with  message-passing 
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protocols  and  tuning  constraints,  it  is  essential  to  have 
an  appropriate  prototyping  environment  for  success  in 
the  design  and  analysis  tasks. 

This  paper  describes  a  message-based  approach 
to  prototyping  study  of  distributed  real-time  database 
systems,  and  presents  a  prototyping  software  imple¬ 
mented  for  a  series  of  experimentation  to  evaluate 
priority-based  synchronization  algorithms. 

2.  Structure  of  the  Prototyping  Environment 

For  a  prototyping  tool  for  distributed  database 
systems  to  be  effective,  appropriate  operating  system 
support  is  mandatory.  Database  control  mechanisms 
n^  to  be  integrated  with  the  operating  system, 
because  the  correct  functioning  of  control  algorithms 
depends  on  the  services  of  the  underlying  operating 
system;  therefore,  an  integrated  design  reduces  the  sig¬ 
nificant  overhead  of  a  layered  approach  during  execu¬ 
tion. 

Although  an  integrated  approach  is  desirable,  the 
system  needs  to  support  flexibility  which  may  not  be 
possible  in  an  integrated  approach.  In  this  regard,  the 
concept  of  developing  a  library  of  modules  with  dif¬ 
ferent  performance  and  reliability  characteristics  for  an 
operadng  system  as  well  as  database  control  functions 
seems  promising.  Our  prototyping  environment  fol¬ 
lows  this  approach  iCook87,  Son88b].  It  is  designed  as 
a  modular,  message-passing  system  to  support  easy 
extensions  and  modifications.  An  instance  of  the  proto¬ 
typing  environment  can  manage  any  number  of  virtual 
sites  specified  by  the  user.  Modules  that  implement 
transaction  processing  are  decomposed  into  several 
server  processes,  and  they  communicate  among  them¬ 
selves  through  ports.  The  clean  interface  between 
server  processes  simplifies  incorporating  new  algo¬ 
rithms  and  facilities  into  the  prototyping  environment, 
or  testing  ^temate  implementations  of  algorithms. 

User  Interface  (UI)  is  a  front-end  invoked  when 
the  prototyping  environment  begins.  UI  is  menu- 
driven,  and  designed  to  be  flexible  in  allowing  users  to 
experiment  various  configurations  with  different  sys¬ 
tem  parameters.  A  user  can  specify  the  following: 

•  system  configuration:  number  of  sites  and  the 
number  of  server  processes  at  each  site. 

•  database  configuration:  daubase  at  each  site  with 
user  defined  structure,  size,  granularity,  and  levels  of 
replication. 

•  load  characteristics:  number  of  transactions  to  be 
executed,  size  of  their  read-sets  and  write-sets,  tran¬ 
saction  types  (read-only  or  update)  and  their  priori¬ 
ties,  and  the  mean  interairival  lime  of  transactions. 

•  concurrency  control;  locking,  timestamp  ordering, 
and  priority-based. 


UI  initiates  the  Configuration  Manager  (CM) 
which  initializes  necessary  data  structures  for  transac¬ 
tion  processing  based  on  user  specification.  CM 
invokes  the  Transaction  Generator  at  an  appropriate 
time  interval  to  generate  the  next  transaction  to  form  a 
Poisson  process  of  transaction  arrival. 

Transaction  execution  consists  of  read  and  write 
operations.  Each  read  or  write  operadon  is  preceded  by 
an  access  request  sent  to  the  Resource  Manager,  which 
maintains  the  local  database  at  each  site.  Each  transac¬ 
tion  is  assigned  to  the  Transaction  Manager  (TM).  TM 
issues  service  requests  on  behalf  of  the  transaction  and 
reacts  appropriately  to  the  request  replies. 

The  Performance  Monitor  interacts  with  the  tran¬ 
saction  managers  to  record,  priority/timestamp  and 
read/write  data  set  for  each  tmnsacuon,  time  when  each 
event  occurred,  statistics  for  each  transaction  and  cpu 
hold  interval  in  each  node.  The  statistics  for  a  transac¬ 
tion  includes  arrival  time,  start  time,  total  processing 
tinrie,  blocked  interval,  whether  deadline  was  missed  or 
not,  and  number  of  aborts. 

3.  Priority-Based  Synchronization 

In  a  real-time  database  system,  synchronization 
protocols  must  not  only  maintain  the  consistency  con¬ 
straints  of  the  database  but  also  satisfy  the  timing 
requirements  of  the  transactions  accessing  the  database. 
To  satisfy  both  the  consistency  and  real-time  con¬ 
straints,  there  is  the  need  to  integrate  synchronization 
protocols  with  real-time  priority  scheduling  protocols. 
A  major  source  of  problems  in  integrating  the  two  pro¬ 
tocols  is  the  lack  of  cowdination  in  the  development  of 
synchronization  protocols  and  real-time  priority 
scheduling  protocols.  Due  to  the  effect  of  blocking  in 
lock-based  synchronization  protocols,  a  direct  applica¬ 
tion  of  a  real-time  scheduling  algorithm  to  transactions 
may  result  in  a  condition  known  as  priority  inversion. 
Priority  inversion  is  said  to  occur  when  a  higher  prior¬ 
ity  process  is  forced  to  wait  for  the  execution  of  a 
lower  priority  process  for  an  indefinite  period  of  time. 
When  the  transactions  of  two  processes  attempt  to 
access  the  same  data  object,  the  access  must  be  serial¬ 
ized  to  maintain  consistency.  If  the  transaction  of  the 
higher  priority  process  gains  access  tirst,  then  the 
proper  priority  order  is  maintained;  however,  if  the 
transaction  of  the  lower  priority  gains  access  Hrst  and 
then  the  higher  priority  transaction  requests  access  to 
the  data  object,  this  higher  priority  process  will  be 
blocked  until  the  lower  priority  transaction  completes 
its  access  to  the  data  object.  Priority  inversion  is  inevit¬ 
able  in  transaction  systems.  However,  to  achieve  a  high 
degree  of  schedulability  in  real-time  applications, 
priority  inversion  must  be  minimized.  This  is  illustrated 
by  the  following  example. 


Example:  Suppose  Ti ,  T2.  and  T3  are  three  tran¬ 
sactions  arranged  in  descending  order  of  priority  with 
T 1  having  the  highest  priority.  Assume  that  T 1  and  T3 
access  the  same  data  object  Oi.  Suppose  that  at  time  ti 
transaction  T3  obtains  a  lock  on  O,.  During  the  execu¬ 
tion  of  T3,  the  high  priority  transaction  Ti  arrives, 
preempts  T3  and  later  attempts  to  access  the  object  Oi- 
Transaction  Tj  will  be  blocked,  since  O,  is  already 
locked.  We  would  expect  that  T),  being  the  highest 
priority  transaction,  will  be  blocked  no  longer  than  the 
time  for  transaction  T3  to  complete  and  unlock  O,. 
However,  the  duration  of  blocldng  may.  in  fact,  be 
unpredictable.  This  is  because  transaction  T3  can  be 
blocked  by  the  intermediate  priority  transaction  T2  that 
does  not  need  to  access  Oi.  The  blocking  of  T3.  and 
hence  that  of  Tj ,  will  continue  until  T2  and  any  other 
pending  intermediate  priority  level  transactions  are 
completed. 

The  blocking  duration  in  the  example  above  can 
be  arbitrarily  long.  This  situation  can  be  partially 
remedied  if  transactions  are  not  allowed  to  be 
preempted:  however,  this  solution  is  only  appropriate 
for  very  short  transactions,  because  it  creates  unneces¬ 
sary  blocking.  For  instance,  once  a  long  low  priority 
transaction  starts  execution,  a  high  priority  transaction 
not  requiring  access  to  the  same  set  of  data  objects  may 
be  nee^essly  blocked. 

An  approach  to  this  problem,  based  on  the  notion 
of  priority  inheritance,  has  been  proposed  (Sha87].  The 
basic  idea  of  priority  inheritaiKe  is  that  when  a  transac¬ 
tion  T  of  a  process  blocks  higher  priority  processes,  it 
executes  at  the  highest  priority  of  all  the  transactions 
blocked  by  T.  This  simple  idea  of  priority  inheritance 
reduces  the  blocking  time  of  a  higher  priority  transi¬ 
tion.  However,  this  is  inadequate  because  the  blocking 
duration  for  a  transaction,  though  bounded,  can  still  be 
substantial  due  to  the  potential  chain  of  blocking.  For 
instance,  suppose  that  transaction  Ti  needs  to  sequen¬ 
tially  access  objects  Oj  and  O2.  Also  suppose  that  T2 
preempts  T3  which  has  already  locked  c4.  Then,  T2 
locks  Oi.  Transaction  T|  arrives  at  this  instant  and 
fmds  that  the  objects  Oi  and  O2  have  been  respec¬ 
tively  locked  by  the  lower  priority  transactions  T2  and 
T3.  As  a  result,  T|  would  be  blocked  for  the  duration 
of  two  transactions,  once  to  wait  for  T2  to  release  O] 
and  again  to  wait  for  T3  to  release  O2.  Thus  a  chain  of 
blocking  can  be  formed. 

Several  methods  to  combat  this  inadequacy  are 
under  investigation.  The  priority  ceiling  protocol  is  one 
of  such  methods  being  investigated  at  the  Camegie- 
Mellon  University  [Sha88].  It  tries  to  achieve  not  only 
minimizing  the  blocking  time  of  a  transaction  to  at 
most  one  lower  priority  oansaction  execution  time,  but 
also  preventing  the  formation  of  deadlocks.  The  prior¬ 
ity  ceiling  protocol  has  been  implemented  in  our  real¬ 


time  database  system  and  compared  with  other  syn¬ 
chronization  protocols  using  the  prototyping  environ¬ 
ment. 

Using  the  prototyping  tool,  we  have  been 
evaluating  the  priority  ceiling  protocol  and  investigat¬ 
ing  technical  issues  associated  with  priority-based 
scheduling  protocols.  One  of  the  issues  we  are  studying 
is  the  comparison  of  the  priority  ceiling  protocol  with 
other  design  alternatives.  In  our  experiments,  all  tran¬ 
sactions  are  assumed  to  be  hard  in  the  sense  that  there 
will  be  no  value  in  completing  a  transaction  alter  its 
deadline.  Transactions  that  miss  the  deadline  are 
aborted,  and  disappear  from  the  system  immediately 
with  some  abort  cost. 

4.  Priority  Ceiling  Protocol 

The  priority  ceiling  protocol  is  premised  on  sys¬ 
tems  with  a  fixed  priority  scheme.  TTie  protocol  con¬ 
sists  of  two  mechanisms:  priority  inheritance  and 
priority  ceiling.  We  already  have  explained  the  prior¬ 
ity  inheritance  mechanism.  In  the  priority  ceiling 
mechanism,  a  priority  ceiling  is  defined  for  every  data 
object  as  the  priority  of  the  highest  priority  transaction 
which  may  access  the  data  object  A  transaction  is 
allowed  to  access  the  data  object  only  if  its  priority  is 
higher  than  the  priority  ceilings  of  all  data  objects 
currently  being  accessed  by  some  transaction  in  the 
system.  With  the  combination  of  these  two  mechan¬ 
isms,  it  has  been  shown  that  in  the  worst  case,  each 
transaction  has  to  wait  for  at  most  one  lower  priority 
transaction  in  its  execution,  and  no  deadlock  will  ever 
occur  [Sha881.  In  the  next  example,  we  show  how 
transactions  are  scheduled  under  the  priority  ceiling 
protocol. 

Example:  Consider  the  same  situation  as  in  the 
previous  example.  According  to  the  protocol,  the 
priority  ceiling  of  Oi  is  the  priority  of  Tj .  When  T2 
tries  to  access  a  data  object,  it  is  blocked  because  its 
priority  is  not  higher  than  the  priority  ceiling  of  Oi. 
Therefore  T j  will  be  blocked  only  once  by  T3  to  enter 
Oj,  regardless  of  the  number  of  data  objects  it  may 
access. 

The  ceiling  manager  implements  the  priority 
ceiling  algorithm  in  the  prototyping  environment  The 
lock  on  a  data  object  can  either  be  a  read-lock  or  a 
write-lock.  The  write-priority  ceiling  of  a  data  object  is 
defined  as  the  priority  of  the  highest  priority  transac¬ 
tion  that  may  write  into  this  object,  and  absolute- 
priority  ceiling  is  defined  as  the  priority  of  the  highest 
priority  transaction  that  may  read  or  write  the  data 
object  When  a  data  object  is  write-locked  (read- 
locked),  the  rw-priority  ceiling  of  this  data  object  is 
defined  to  be  equal  to  the  absolute  (write)  priority  ceil¬ 
ing. 


When  a  transaction  attempts  to  lock  a  data 
object,  the  transaction’s  priority  is  compared  with  the 
highest  rw-priority  ceiling  of  all  data  items  currently 
locked  by  other  transactions.  If  the  priority  of  the  tran¬ 
saction  is  not  higher  than  the  rw-priority  ceiling,  it  will 
be  denied.  Otherwise,  it  is  granted  the  lock.  In  the 
denied  case,  the  prioity  inheritance  is  performed  in 
order  to  overcome  the  problem  of  uncontrolled  priority 
inversion. 

Under  this  protocol,  it  is  not  necessary  to  check 
for  the  possibility  of  read-write  conflicts.  For  instance, 
when  a  data  object  is  write-locked  by  a  transaction,  the 
rw-priority  ceiling  is  equal  to  the  highest  priority  tran¬ 
saction  that  can  access  it  Hence,  the  protocol  will 
block  a  higher  priority  transaction  that  may  write  or 
read  it  On  the  other  hand,  when  the  data  object  is 
read-locked,  the  rw-priority  ceiling  is  equal  to  the 
highest  priority  transaction  that  may  write  it  Hence,  a 
transaction  that  attempts  to  write  it  will  have  a  priority 
no  higher  than  the  rw-priority  ceiling  and  wilt  be 
blocked.  Only  the  transaction  that  read  it  and  have 
priority  higher  than  the  rw-priority  ceiling  will  be 
allowed  to  read-lock  it,  since  read-locks  are  compati¬ 
ble. 

5.  Performance  Evaluation 

Various  statistics  have  been  collected  for  com¬ 
paring  the  performance  of  the  priority-ceiling  protocol 
with  other  synchronization  control  ^gotithms.  Tran¬ 
saction  are  generated  with  exponentially  distributed 
interarrival  times,  and  the  data  objects  updated  by  a 
transaction  are  chosen  uniformly  from  the  database.  A 
transaction  has  an  execution  profile  which  alternates 
data  access  requests  with  equ^  computation  requests, 
and  some  processing  requirement  for  termination 
(either  commit  or  abort).  Thus  the  total  processing  time 
of  a  transaction  is  directly  related  to  the  number  of  data 
objects  accessed.  Due  to  space  considerations,  we  can¬ 
not  present  all  our  results  but  have  selected  the  graphs 
which  best  illustrate  the  difference  and  performance  of 
the  algorithms.  For  example,  we  have  omitted  the 
results  of  an  experiment  that  varied  the  size  of  the  data¬ 
base,  and  thus  the  number  of  conflicts,  because  they 
only  confirm  and  not  increase  the  knowledge  yielded 
by  other  experiments. 

For  each  experiment  and  for  each  algorithm 
tested,  we  collected  performance  statistics  and  aver¬ 
aged  over  the  10  runs.  The  percentage  of  deadline- 
missing  transactions  is  calculated  with  the  following 
equation:  %missed  s  100  *  (number  of  deadline- 
missing  transactions  /  number  of  transactions  pro¬ 
cessed).  A  transaction  is  processed  if  either  it  executes 
completely  or  it  is  aborted.  We  assume  that  all  the  tran¬ 
sactions  are  hard  in  the  sense  that  there  will  be  no 
value  for  completing  the  transaction  after  its  deadline. 


Transactions  that  miss  the  deadline  are  aborted,  and 
disappeared  from  the  system  immediately  with  some 
abort  cosL  We  have  used  the  transaction  size  (the 
number  of  data  objects  a  transaction  needs  to  access)  as 
one  of  the  key  variables  in  the  experiments.  It  varies 
from  a  small  fraction  up  to  a  relatively  large  portion 
(10%)  of  the  database  so  that  conflict  would  occur  fre¬ 
quently.  The  high  conflict  rate  allows  synchronization 
protocols  to  play  a  significant  role  in  the  system  perfor¬ 
mance.  We  choose  the  arrival  rate  so  that  protocols  are 
tested  in  a  heavily  loaded  rather  than  lightly  loaded  sys¬ 
tem.  It  is  because  for  designing  real-time  systems,  one 
must  consider  high  load  situations.  Even  though  they 
may  not  arise  frequently,  one  would  like  to  have  a  sys¬ 
tem  that  misses  as  few  deadlines  as  possible  when  such 
peaks  occur.  In  other  words,  when  a  crisis  occurs  and 
the  database  system  is  under  pressure  is  precisely  when 
making  a  few  extra  deadlines  could  be  most  important 
tAbb88]. 

We  normalize  the  transaction  throughput  in 
records  accessed  per  second  for  successful  transactions, 
not  in  transactions  per  second,  in  order  to  account  for 
the  fact  that  bigger  transactions  need  more  database 
processing.  The  normalization  rate  is  obtained  by  mul¬ 
tiplying  the  transaction  completion  rate 
(transactions/second)  by  the  transaction  size  (database 
records  accessed/transaction).  In  Figure  1,  the 
throughput  of  the  priority-ceiling  protocol  (C),  the 
two-ph^  locking  protocol  with  priority  mode  (P),  and 
the  two-phase  locking  protocol  without  priority  mode 
(L),  is  shown  for  uansactions  of  of  different  sizes  with 
balanced  workload  and  I/O  bound  workload. 

As  the  transaction  size  increases,  there  is  little 
impact  on  the  throughput  of  priority-ceiling  protocol 
over  the  range  of  transaction  sizes  and  over  the  work¬ 
load  type  shown  in  Figure  1.  This  is  because  in 
priority-ceiling  protocol,  the  conflict  rate  is  determined 
by  ceiling  blocking  rather  than  direct  blocking,  and  the 
frequency  of  ceiling  blocking  is  not  sensitive  to  the 
transaction  size. 

However,  the  performance  of  the  two-phase 
locking  protocol  with  or-  without  priority  degrades  very 
rapidly.  This  phenomenon  is  more  clear  as  the  transac¬ 
tion  workload  is  closer  to  I/O  bound,  since  there  are 
few  conflicts  for  the  small  transactions  in  the  two- 
phase  locking  protocol,  and  the  concurrency  is  fully 
achieved  in  the  assumption  of  parallel  I/O  processing. 
Poor  perfonnance  of  the  two-phase  locking  protocol 
for  bigger  transactions  is  due  to  the  high  conflict  rate. 

Since  I/O  cost  is  one  of  the  key  parameters  in 
determining  performance,  we  have  investigated  an 
approach  to  improve  system  performance  by  perform¬ 
ing  I/O  operation  before  locking.  This  is  c^ed  the 
intention  HO.  In  the  intention  mode  of  I/O  operation, 
the  system  pre-fetches  data  objects  that  are  in  the 


access  lists  of  transactions  submitted,  without  locking 
them.  This  approach  will  reduce  the  locking  time  of 
data  objects,  resulting  in  higher  throughput  As  shown 
in  Figure  2,  intention  VO  improves  throughput  of  both 
the  two-phase  locking  and  the  ceiling  protocol.  How¬ 
ever,  improvement  in  the  ceiling  protocol  is  much  more 
significant  This  is  because  the  frequency  of  ceiling 
blocking  is  very  sensitive  to  the  duration  of  data  object 
locking  in  the  system. 

Another  important  performance  statistics  is  the 
percentage  of  deadline  missing  transactions,  since  the 
synchronization  protocol  in  real-time  database  systems 
must  satisfy  the  timing  constraint  of  individual  transac¬ 
tion.  In  our  experiments,  each  transaction’s  deadline  is 
set  to  proportional  to  its  size  and  system  workload 
(numba  of  transactions),  and  the  transaction  with  the 
earliest  deadline  is  assigned  the  highest  priority.  As 
shown  in  Figure  3,  the  percentage  of  deadline  missing 
transactions  increases  sharply  for  the  two-phase  lock¬ 
ing  protocol  as  the  transaction  size  increases.  A  sharp 
rise  was  expected,  since  the  probability  of  deadlocks 
would  go  up  with  the  fourth  power  of  the  transaction 
size  [GraySl].  However,  the  percentage  of  deadline 
missing  transactions  increases  much  slowly  as  the  tran¬ 
saction  size  increases  in  the  priority-ceiling  protocol, 
since  there  is  no  deadlock  in  priority-ceiling  protocol 
and  the  response  time  is  proportional  to  the  transaction 
size  and  the  priority  ranking. 

6.  Conclusions 

Prototyping  large  software  systems  is  not  a  new 
approach.  However,  methodologies  for  developing  a 
prototyping  environment  for  distributed  database  sys¬ 
tems  have  not  been  investigated  in  depth  in  spite  of  its 
potential  benefits.  In  this  paper,  we  have  presented  a 
prototyping  environment  that  has  been  developed  based 
on  a  message-based  approach  with  modular  building 
blocks.  Although  the  complexity  of  a  distributed  data¬ 
base  system  makes  prototyping  difficult,  the  implemen¬ 
tation  has  proven  satisfactory  for  experimentation  of 
design  choices,  different  database  techniques  and  pro¬ 
tocols,  and  even  an  integrated  evaluation  of  datahase 
systems.  It  supports  a  very  flexible  user  interface  to 
allow  a  wide  range  of  system  configurations  and  work¬ 
load  characteristics.  Expressive  power  and  perfor¬ 
mance  evaluation  capability  of  our  prototyping 
environment  has  been  demonstrated  by  implementing  a 
distributed  real-time  database  system  and  investigating 
its  performance  characteristics. 

There  are  many  technical  issues  associated  with 
priority-based  synchronization  protocols  that  need 
further  investigation.  For  example,  the  analytic  study 
of  the  priority  ceiling  protocol  provides  an  interesting 
observation  that  the  use  of  read  and  write  semantics  of 
a  lock  may  lead  to  worse  performance  in  terms  of 


schedulability  than  the  use  of  exclusive  semantics  of  a 
lock.  This  means  that  the  read  semantics  of  a  lock  can¬ 
not  be  used  to  allow  several  readers  to  hold  the  lock  on 
the  data  object,  and  the  ownership  of  locks  must  be 
mutually  exclusive.  Is  it  necessaiily  true?  We  are 
investigating  this  and  other  related  issues  using  the  pro¬ 
totyping  environment 
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1.  Introduction 

The  need  for  a  recovery  mechanism  in  a  database  system  is  well  understood.  In  spite  of  powerful  database 
integrity  checking  mechanisms  which  detect  errors  and  undesirable  data,  it  is  possible  that  some  erroneous  data  may 
be  included  in  the  database.  Funhermore,  even  with  a  perfect  integrity  checking  mechanism,  failures  of  hardware 
and/or  software  at  the  processing  sites  may  destroy  consistency  of  the  database.  In  order  to  cope  with  those  errors 
and  failures,  database  systems  provide  recovery  mechanisms,  and  checkpointing  is  a  technique  frequently  used  in 
database  recovery  mechanisms. 

The  goal  of  checkpointing  in  database  systems  is  to  read  and  return  current  values  of  the  data  objects  in  the 
system.  A  checkpointing  procedure  would  be  very  useful,  if  states  it  returns  are  guaranteed  to  be  consistent  In  a 
bank  database,  for  example,  a  checkpoint  can  be  used  to  audit  all  of  the  account  balances  (or  the  sum  of  all  account 
balances).  It  can  also  be  used  for  failure  detection;  if  a  checkpoint  produces  an  inconsistent  system  state,  one 
assumes  that  an  error  has  occurred  and  takes  appropriate  recovery  measures.  In  case  of  a  failure,  previous  check¬ 
points  can  be  used  to  restore  the  database.  Checiqwinting  must  be  performed  so  as  to  minimize  both  the  costs  of 
performing  checkpoints  and  the  costs  of  recovering  the  database.  If  the  checkpoint  intervals  are  very  short,  too 
much  time  and  resources  are  spent  in  checkpointing;  if  these  intervals  are  long,  too  much  time  is  spent  in  recovery. 

For  a  checkpoint  process  to  return  a  meaningful  result  (e.g.,  a  consistent  state),  the  individual  read  steps  of  the 
checkpoint  must  not  be  permitted  to  interleave  with  the  steps  of  other  transactions:  otherwise  an  inconsistent  state 
can  be  returned  even  for  a  correctly  operating  system.  However,  since  checkpointing  is  performed  during  normal 
operation  of  the  system,  this  requirement  of  non-interference  will  result  in  poor  perfoimance.  For  example,  in  order 
to  generate  a  commit  consistent  checkpoint  for  recovery,  user  transactions  may  suffer  a  long  delay  waiting  for  active 
transactions  to  complete  and  the  updates  to  be  reflected  in  the  database  [CHA851.  A  transaction  is  said  to  be 
reflected  in  the  database  if  the  values  of  data  objects  represent  the  updates  made  by  the  transaction.  It  is  highly 
desirable  that  transactions  are  executed  in  the  system  concurrently  with  the  checkpointing  process.  In  distributed 
systems,  the  desirable  properties  of  non-interference  and  global  consistency  make  checkpointing  more  complicated 
because  we  need  to  consider  coordination  among  autonomous  sites  of  the  system. 

Recently,  the  possibility  of  having  a  checkpointing  mechanism  that  does  not  interfere  with  transaction  pro¬ 
cessing,  and  yet  achieves  consistency  of  the  checkpoints,  has  been  studied  [CHA85,  nS82.  SON86bl.  The  motiva¬ 
tion  for  non-interfering  checkpointing  is  to  improve  system  availability,  that  is,  the  system  must  be  able  to  execute 
user  transactions  concurrently  with  the  check^inting  process.  The  principle  behind  non-interfering  checkpointing 
mechanisms  is  to  create  a  diverged  computation  of  the  system  such  that  the  checkpointing  process  can  view  a  con¬ 
sistent  state  that  could  result  by  running  to  completion  all  of  the  uansactions  that  are  in  progress  when  the  check¬ 
point  begins,  instead  of  viewing  a  consistent  state  that  actually  occurs  by  suspending  funher  transaction  execution. 
Figure  1  shows  a  diverged  compuution  during  checkpointing. 

Non-interfering  checkpointing  mechanisms,  however,  may  suffer  from  the  fact  that  the  diverged  computation 
needs  to  be  maintained  by  the  system  until  all  of  the  transactions,  that  are  in  progress  when  the  checkpoint  begins, 
come  to  completion.  This  may  not  be  a  major  concern  for  a  daubase  system  in  which  all  the  transactions  are  rela¬ 
tively  short.  However,  for  database  systems  with  many  long-lived  transactions,  checkpointing  of  this  kind  might  not 
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Fig.  1.  [>i verged  computation  for  checkpointing 
be  practical  for  the  following  reasons: 

(1)  It  takes  a  long  time  to  complete  a  non-interfering  checkpoint,  resulting  in  high  storage  and  processing  over¬ 
head. 

(2)  If  a  crash  occurs  before  the  results  of  a  long-lived  transaction  are  included  in  the  checkpoint,  the  system 
must  re-execute  the  transacdon  from  the  begiiuiing,  wasting  all  the  resources  used  for  the  initial  execution  of 
the  transacdon. 

In  the  rest  of  this  paper,  we  briefly  discuss  one  approach  for  checkpointing  which  efficiently  generates  a  con¬ 
sistent  database  state,  and  its  adaptation  for  systems  with  long-lived  transactions.  Given  our  space  limitations,  our 
objecdve  is  to  intuidvely  explain  this  approach  and  not  to  provide  details.  The  details  are  given  in  separate  papers 
[SON86b,  SON88]. 

2.  Non-interfering  Approach 

In  order  to  make  each  checkpoint  consistent,  updates  of  a  transacdon  must  either  be  included  in  the  check¬ 
point  completely  or  not  at  all.  To  achieve  this,  transactions  are  divided  into  two  groups  according  to  their  reladons 
to  the  current  checkpoint:  <^er-checkpoiiu  transactions  (ACPT)  and  bcfore<heckpoint  transactions  (BCPT). 
Updates  belonging  to  BCPT  are  iiKluded  in  the  current  checkpoint  while  thoK  belonging  to  ACFT  are  not  iiKluded. 
In  a  centralized  database  system,  it  is  an  easy  task  to  separate  transacdons  for  this  purpose.  However,  it  is  not  easy 
in  a  distributed  envirorunent  To  separate  ttansactkxts  in  a  distributed  environment,  a  special  dmestamp  which  is 
globally  agreed  upon  by  the  participating  sites  is  used.  This  special  dmestamp  is  called  the  Global  Checkpoint 
Number  (GCPN),  and  it  is  determined  as  the  maximtun  of  the  Local  Checkpoint  Numbers  (LCPN)  through  coordi- 
'  nation  of  all  participating  sites. 

An  ACPT  can  be  reclassified  as  a  BCPT  if  its  dmestamp  requires  that  the  transaction  must  be  executed  before 
the  current  checkpoint.  This  is  called  the  conversion  of  transacdons.  The  updates  of  a  convened  transaction  are 
included  in  the  current  checkpoint 

Two  types  of  processes  are  involved  in  the  checkpoint  execution:  checkpoint  coordinator  (CC)  and  check¬ 
point  subordinate  (CS).  The  checkpoint  coordinator  starts  and  terminates  the  global  checkpointing  process.  Once  a 
checkpoint  has  started,  the  coordinator  does  not  issue  the  next  checkpoint  request  until  the  first  one  has  terminated. 
At  each  site,  the  checkpoint  subordinate  performs  local  checkpoindng  by  a  request  from  the  coordinator.  We  assume 
that  site  m  has  a  local  clock  LC„  which  is  manipulated  by  the  clock  rules  of  L^pon[LAM78]. 

Execution  of  a  checkpoint  progresses  as  follows.  First,  the  checkpoint  coordinator  broadcasts  a  Checkpoint 
Request  Message  with  a  timestamp  LCcc-  The  local  checkpoint  number  of  the  coordinator  is  set  to  LQ-c-  The  coor¬ 
dinator  sets  the  Boolean  variable  CONVERT  to  false,  and  marks  all  transacdons  at  the  coordinator  site  with  dmes- 
tamps  not  greater  than  LCPNcc  as  BCPT. 
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On  receiving  a  Checkpoint  Request  Message,  the  local  clock  of  site  m  is  updated  and  LCPNa,  is  set  to  LC,„. 
The  checkpoint  subordinate  of  site  m  replies  to  the  coordinator  with  LCPN„,  and  sets  the  Boolean  variable  CON¬ 
VERT  to  false.  The  broadcasts  the  GCPN  which  is  determined  as  the  maximum  of  the  local  checkpoint 

numbers. 

In  aU  sites,  after  the  LCPN  is  fixed,  all  transactions  with  timestamps  greater  than  the  LCPN  are  marked  as 
temporary  ACPTs.  If  a  temporary  ACPT  updates  any  d^  objects,  those  data  objects  are  copied  from  the  database 
to  the  buffer  space  of  the  transaction.  When  a  temporary  ACFT  commits,  updated  data  objects  are  not  stored  in  the 
dataha.se  as  usual,  but  are  maintained  as  committed  temporary  versions  (CTVO  of  the  data  (Ejects.  The  data  manager 
in  each  site  maintains  permanent  and  temporary  versions  of  data  objects.  When  a  read  request  is  made  for  a  data 
object  which  has  committed  temporary  versions,  the  value  of  the  latest  committed  temporary  version  is  renimed. 
When  a  write  request  is  made  for  a  data  object  which  has  committed  temporary  versions,  another  committed  tem¬ 
porary  version  is  created  for  it  rather  than  overwriting  the  previous  committed  temporary  version. 

When  the  GCPN  is  known,  each  checkpointing  process  compares  the  timestamps  of  the  temporary  ACPTs 
with  the  GCPN.  Transactions  that  satisfy  the  following  condition  bwome  BCPTs;  their  updates  are  reflected  in  the 
database,  and  are  included  in  the  current  checkpoint 

LCPN  <  tiinestamp(T)  S  GCPN 

The  remaining  tnrporary  ACPTs  are  actual  ACPTs:  their  updates  are  not  included  in  the  current  checkpoint  These 
updates  are  included  in  the  database  after  the  current  checkpointing  has  been  completed.  After  the  conversion  of  all 
eligible  BCPTs,  the  checkpointing  process  sets  the  Boolean  variable  CONVERT  to  true.  Local  checkpointing  is  exe¬ 
cuted  by  saving  the  state  of  data  objects  when  there  is  no  active  BCPT  and  the  variable  CONVERT  is  true.  After  the 
execution  of  local  checkpointing,  the  values  of  the  latest  committed  temporary  versions  are  used  to  replace  the 
values  of  data  objects  in  the  database.  Then,  all  corrunitted  temporary  versions  are  deleted.  Execution  sequences  of 
two  different  types  of  transactions  are  shown  in  Figure  2. 

As  an  example,  consider  a  three-site  distributed  database  system.  Assume  that  LCcc  *  S,  LCcsi  »  3,  and 
LCcs2  *  S'  iis  LCPN  as  S,  and  broadcasts  a  checkpoint  request  message.  On  receiving  the  request  message, 

LCPN  of  each  CS  is  set  to  6  and  9,  respectively.  After  anodier  round  of  message  exchange,  the  GCPN  of  the  current 
checkpoint  will  be  set  to  9  by  the  CC  and  wiU  be  known  to  each  CS.  If  transac  on  T;  with  the  timestamp  7  was  ini¬ 
tiated  at  the  site  of  CSl,  it  is  treated  as  an  ACPT.  All  updates  by  Tj  are  maintained  as  CTV.  However,  when  GCPN 
is  known,  Ti  will  be  converted  to  a  BCPT  and  its  updates  will  be  included  in  the  :urTent  checkpoint. 

3.  Adaptive  Approach  for  Long-lived  Transactions 

It  can  be  shown  that  a  non-interfering  checkpointing  process  will  terminate  in  a  finite  time  by  selecting  an 
appropriate  concurrency  control  mechanisms  [SONS?].  However,  the  amount  of  time  necessary  to  complete  one 


Fig.  2.  Execution  sequences  of  ACPT  and  BCPT 
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checkpoint  cannot  be  bound  in  advance;  it  depends  on  the  execution  time  of  the  longest  transaction  classified  as  a 
BCPT.  Therefore  the  storage  and  processing  cost  of  the  checkpointing  algorithm  may  become  unacceptably  high  if  a 
long-lived  transaction  is  included  in  the  set  of  BCPTs.  We  briefly  discuss  the  practicality  of  non-interfering  check¬ 
points  in  the  next  section.  In  addition,  all  resources  used  for  the  execution  of  a  long-lived  transaction  would  be 
wasted  if  the  transaction  must  be  re-executed  from  the  beginning  due  to  a  system  failure. 

These  problems  can  be  solved  by  using  an  adaptive  checkpointing  approach.  We  assume  that  each  transacuon 
must  carry  a  flag  with  it,  which  tells  whether  it  is  a  normal  transaction  or  a  long-lived  transaction.  The  threshold  to 
separate  two  types  of  transactions  is  application-dependent.  In  general,  transactions  that  need  hours  of  execution  can 
be  considered  as  long-lived  transactions. 

An  adaptive  checkpointing  procedure  operates  in  two  different  modes:  global  mode  and  local  mode.  The  glo¬ 
bal  mode  of  operation  is  basically  the  procedure  sketched  in  the  previous  section.  In  the  local  mode  of  operauon,  a 
mechanism  is  provided  to  save  consistent  states  of  a  transaction  so  that  the  transaction  can  resume  execution  from 
its  most  recent  checkpoint 

As  in  the  previous  approach,  the  checkpoint  coordinator  begins  checkpointing  by  sending  out  Checkpoint 
Request  Messages.  Upon  receiving  this  request  message,  each  site  checks  whether  any  long-lived  transaction  is 
being  executed  at  the  site.  If  so,  the  site  reports  it  to  the  coordinator,  instead  of  sending  its  LCPN.  Otherwise  (i.e.,  no 
long-lived  transaction  in  the  system),  non-interfering  checkpointing  begins.  If  any  site  reports  the  existence  of  a 
long-lived  transaction,  the  coordinator  switches  to  the  local  mode  of  operation,  and  informs  each  site  to  operate  in 
the  local  mode.  The  checkpoint  coordinator  sends  Checkpoint  Request  Messages  to  each  site  at  an  appropriate  time 
interval  to  initiate  the  next  checkpoint  in  the  global  mode.  This  attempt  will  succeed  if  there  is  no  active  long-lived 
transaction  in  the  system. 

In  the  local  mode  of  operation,  each  long-lived  transaction  is  checkpointed  separately  from  other  long-lived 
transactions.  The  coordinator  of  the  long-lived  transaction  initiates  the  checkpoint  by  sending  Checkpoint  Request 
Messages  to  its  participants.  A  checkpoint  at  each  site  saves  the  local  state  of  a  long-lived  transaction.  For  satisfying 
the  correctness  requirement,  a  set  of  checkpoints,  one  per  each  participating  site  of  a  global  long-lived  transaction, 
should  reflect  the  consistent  state  of  the  transaction.  Inconsistent  set  of  checkpoints  may  result  from  a  non- 
synchronized  execution  of  associated  checkpoints.  For  example,  consider  a  long-lived  transaction  T  being  executed 
at  sites  P  and  Q,  and  a  checkpoint  taken  at  site  P  at  fime  X,  and  at  site  Q  at  ume  Y.  If  a  message  M  is  sent  from  P 
after  X,  and  received  at  Q  before  Y,  then  the  checkpoints  would  save  the  reception  of  M  but  not  the  sending  of  M, 
resulting  in  a  checkpoint  representing  an  inconsistent  state  of  T. 

We  use  message  numbers  to  achieve  consistency  in  a  set  of  local  checkpoints  of  a  long-lived  transaction. 
Messages  that  are  exchanged  by  participating  transaction  managers  of  a  long-lived  transaction  contain  message 
number  tags.  Transaction  managers  of  a  long-lived  transaction  use  monotonically  increasing  numbers  in  the  tag  of 
its  outgoing  messages,  and  each  maintains  the  tag  numbos  of  the  latest  message  it  received  from  other  participants. 
On  receiving  a  checkpoint  request,  a  participant  compares  the  message  number  attached  to  the  request  message  with 
the  last  tag  number  it  received  from  the  coordinator.  The  participant  replies  OK  to  the  coordinator  and  executes 
local  checkpointing  only  if  the  request  tag  number  is  not  less  than  the  number  it  has  maintained.  Otherwise,  it 
reports  to  the  coordinator  that  the  checkpoint  cannot  be  executed  with  that  request  message. 

If  all  replies  from  the  participants  arrive  and  are  all  OK,  the  coordinator  decides  to  make  all  local  checkpoints 
permanent.  Otherwise,  the  decision  is  to  discard  the  current  checkpoint,  and  to  initiate  a  new  checkpoint  This  deci¬ 
sion  is  delivered  to  all  participants.  After  a  new  permanent  checkpoint  is  taken,  any  previous  checkpoints  will  be 
discarded  at  each  site. 

4.  Performance  Considerations 

There  are  two  performance  measures  that  can  be  used  in  discussing  the  practicality  of  non-interfering  check¬ 
pointing:  extra  storage  and  extra  workload  required.  The  extra  storage  requirement  of  the  algorithm  is  simply  the 
CTV  file  size,  which  is  a  function  of  the  expected  number  of  ACPTs  of  the  site,  the  number  of  data  objects  updated 
by  a  typical  transaction,  and  the  size  of  the  basic  unit  of  information: 

CTV  file  size  =  N;^x(number  of  updates)x(size  of  the  data  object) 

where  is  the  expected  number  of  ACPT  of  the  site. 

The  CTV  file  may  become  unacceptably  large  if  or  the  number  of  updates  becomes  very  large.  Unfor¬ 
tunately,  they  are  determined  dynamically  from  the  characteristics  of  transacbons  submitted  to  the  database  system, 
and  hence  cannot  be  controlled.  Since  is  proportional  to  the  execution  time  of  the  longest  BCPT  at  the  site,  it 


would  become  unacceptably  large  if  a  long-lived  transaction  is  being  executed  when  a  checkpoint  begins  at  the  site. 
The  only  parameter  we  can  change  in  order  to  reduce  the  CTV  file  size  is  the  granularity  of  a  data  object.  The  size 
of  the  CTV  file  can  be  minimized  if  we  minimize  the  size  of  the  data  object.  By  doing  so,  however,  the  overhead  of 
normal  transaction  processing  (e.g.,  locking  and  unlocking,  deadlock  detection,  etc)  will  be  increased.  Also,  there  is 
a  trade-off  between  the  degree  of  concurrency  and  the  lock  granularity [RIE79].  Therefore  the  granularity  of  a  data 
object  should  be  determined  carefully  by  considering  all  such  trade-offs,  and  we  cannot  minimize  the  size  of  the 
CTV  file  by  simply  minimizing  the  data  ^ject  granularity. 

There  is  no  extra  storage  requirement  in  intrusive  checkpointing  mechanisms(DAD80,  KUS82,  SCH80]. 
However  this  property  is  balanced  by  the  cases  in  which  the  system  must  block  the  execution  of  an  ACPT  or  abort 
transactions  b^ause  of  the  checkpointing  process. 

The  extra  workload  imposed  by  the  algorithm  mainly  consists  of  the  workload  for  (1)  determining  the  GCPN, 
(2)  committing  ACPT  (move  data  objects  to  the  CTV  file),  (3)  reflecting  the  CTV  file  (move  committed  temporary 
versions  from  the  CTV  file  to  the  database),  ar>d  (4)  clearing  the  CTV  file  when  the  reflect  operation  is  finished. 
Among  these,  the  workload  for  (2)  and  (3)  dominates  the  total  extra  workload,  as  in  the  estimation  of  extra  storage, 
the  workload  for  (2)  and  (3)  is  determined  by  the  number  of  ACPTs  and  the  number  of  updates.  Therefore,  as  long 
as  the  values  of  these  variables  can  be  maintained  below  a  certain  threshold  level,  non-interfering  checkpointing 
would  not  severely  degrade  the  performance  of  the  system.  A  detailed  discussion  of  the  practicality  of  non¬ 
interfering  checkpointing  is  given  in  [SONsfib]. 

5.  Site  Failures 

So  far,  we  assumed  that  no  failure  occurs  during  checkpointing.  This  assumption  can  be  justified  if  the  proba¬ 
bility  of  failures  during  a  single  checkpoint  is  extremely  small.  However,  it  is  not  always  the  case,  and  we  now  con¬ 
sider  the  method  to  make  the  algorithm  resilient  to  failures. 

During  the  global  mode  of  operation,  the  checkpointing  process  is  insensitive  to  failures  of  subordinates.  If  a 
subordinate  fails  before  the  broadcast  of  a  Checkpoint  Request  Message,  it  is  excluded  from  the  next  checkpoint.  If 
a  subordinate  does  not  send  its  LCPN  to  the  coordinator,  it  is  excluded  from  the  current  checkpoint  When  the  site 
recovers,  the  recovery  manager  of  the  site  must  determine  the  GCPN  of  the  latest  checkpoint.  After  receiving  infor¬ 
mation  about  transactions  which  must  be  executed  for  recovery,  the  recovery  ma:  jger  brings  the  database  up  to  date 
by  executing  all  transactions  whose  timestamps  are  not  greater  than  the  latest  <'iCPN.  Other  transactions  are  exe¬ 
cuted  after  the  state  of  the  data  objects  at  the  site  is  saved  by  the  checkpointing  process. 

An  atomic  commit  protocol  guarantees  that  a  transaction  is  aborted  if  any  participant  fails  before  it  sends  a 
Precommit  message  to  the  coordinator.  Therefore,  site  failures  during  the  execution  of  the  algorithm  cannot  affect 
the  consistency  of  checkpoints  because  each  checkpoint  reflects  only  the  updates  of  commiaed  BCPTs. 

In  the  local  mode  of  operation,  the  failure  of  a  participant  prevents  the  coordinator  from  receiving  OKs  from 
all  participants,  or  prevents  the  participants  from  receiving  the  decision  message  from  the  coordinator.  However, 
because  a  transaction  is  aborted  by  an  atomic  commit  protocol,  it  is  not  necessary  to  make  checkpointing  robust  to 
failures  of  participants. 

The  algorithm  is,  however,  sensitive  to  failures  of  the  coordinator.  In  particular,  if  the  coordinator  crashes 
during  the  first  phase  of  the  global  mode  of  operation  (i.e.,  before  the  GCPN  message  is  sent  to  subordinates),  every 
transaction  becomes  an  ACPT,  requiring  too  much  storage  for  committed  temporary  versions. 

One  possible  solution  to  this  involves  the  use  of  a  number  of  backup  processes;  uhese  are  processes  that  can 
assume  responsibility  for  completing  the  coordinator’s  activity  in  the  event  of  its  failure.  These  backup  processes 
are  in  fact  checkpointing  subordinates.  If  the  coordinator  fails  before  it  broadcasts  the  GCPN  message,  one  of  the 
backups  takes  control.  A  similar  mechanism  is  used  in  SDD-1  [HAM80]  for  reliable  commitment  of  transactions. 

6.  Recovery 

A  recovery  from  site  crashes  is  called  a  site  recovery.  The  complexity  of  a  site  recovery  varies  in  distributed 
database  systems  according  to  the  failure  situation[SCH80].  If  the  crashed  site  has  no  replicated  data  objects  and  if 
all  recovery  information  is  available  at  the  crashed  site,  local  recovery  is  sufficient.  Global  recovery  is  necessary 
because  of  failures  which  require  the  global  database  to  be  restored  to  some  earlier  consistent  state.  For  instance,  if 
the  transaction  log  is  partially  destroyed  at  the  crashed  site,  local  recovery  cannot  be  executed  to  completion. 

When  a  global  recovery  is  required,  the  database  system  has  two  alternatives:  a  fast  recovery  and  a  complete 
recovery.  A  fast  recovery  is  a  simple  restoration  of  the  latest  global  checkpoint.  Since  each  checkpoint  is  globally 
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consistent,  the  restored  state  of  the  database  is  assured  to  be  consistent  However,  all  transactions  committed  during 
the  time  interval  firom  the  latest  checkpoint  to  the  time  of  crash  would  be  lost.  A  complete  recovery  is  performed  to 
restore  as  many  transactions  that  can  be  redone  as  possible.  The  trade-offs  between  the  two  recovery  methods  are 
the  recovery  time  and  the  number  of  transactions  saved  by  the  recovery. 

Quick  recovery  from  failures  is  critical  for  some  applications  of  distributed  database  systems  which  require 
high  availability  (e.g.,  ballistic  missile  defense  or  air  traffic  control).  For  those  applications,  the  fate  of  the  mission, 
or  even  the  lives  of  human  beings,  may  depend  on  the  correct  values  of  the  data  and  the  accessibility  to  it.  Availabil¬ 
ity  of  a  consistent  state  is  of  primary  concern  for  those  applications,  not  the  most  up-to-date  consistent  state.  If  a 
simple  restoration  of  the  latest  checkpoint  could  bring  the  database  to  a  consistent  state,  it  may  not  be  worthwhile  to 
spend  time  in  recovery  by  executing  a  complete  recovery  to  recover  some  of  the  transactions. 

For  the  applications  in  which  each  committed  transaction  is  so  important  that  the  most  up-to-date  consistent 
state  of  the  database  is  highly  desirable,  or  if  the  checkpoint  intervals  are  large  such  that  a  lot  of  transactions  cannot 
be  recovered  by  a  fast  recovery,  a  complete  recovery  is  appropriate.  The  cost  of  a  complete  recovery  is  the 
increased  recovery  time  which  r^uces  availability  of  the  database.  Searching  through  the  transaction  log  is  neces¬ 
sary  for  a  complete  recovery.  The  property  that  each  checkpoint  reflects  all  updates  of  transactions  with  earlier 
timestamps  than  its  GCPN  is  useful  in  reducing  the  amount  of  searching,  because  the  set  of  transactions  whose 
updates  must  be  redone  can  be  determined  by  a  simple  comparison  of  the  timestamps  of  transactions  with  the  GCPN 
of  the  checkpoint  Complete  recovery  mechanisms  based  on  the  special  timestamp  of  checkpoints  (e.g.,  GCPN) 
have  been  proposed  in  [KUS82,  SON86a]. 

After  site  recovery  is  completed  using  either  a  fa^  recovery  procedure  or  a  complete  recovery  procedure,  the 
recovering  site  checks  whether  it  has  completed  local-mode  checkpointing  for  any  long-lived  transactions.  If  any 
local-mode  checkpoint  is  found,  those  transactions  can  be  restarted  from  the  sav^  checkpoints.  In  this  case,  the 
coordinator  of  the  transaction  requests  all  participants  to  restart  from  their  checkpoints  if  and  only  if  they  all  are  able 
to  restart  from  that  checkpoint.  The  coor^nator  decides  whether  to  restart  the  transaction  from  the  checkpoint  or 
from  the  beginning  based  on  responses  from  the  participants,  and  sends  the  decision  message  to  all  participants. 
Such  a  two-phase  recovery  protocol  is  necessary  to  maintain  consistency  of  the  database  in  case  of  damaged  check¬ 
points  at  the  failure  site.  A  transaction  will  be  restarted  from  the  beginning  if  any  participant  is  not  able  to  restore 
the  checkpointed  state  of  the  transaction  for  any  reason. 

7.  Concluding  Remarks 

During  normal  operation,  checkpointing  is  performed  to  save  information  for  recovery  from  failure.  For  better 
recoverability  and  availability  of  distributed  databases,  checkpointing  must  allow  construction  of  a  globally  con¬ 
sistent  daiab^  state  without  interfering  with  transaction  processing.  Site  autonomy  in  distributed  database  systems 
makes  checkpointing  more  complicated  than  in  centralized  systems. 

The  role  of  the  checkpointing  coordinator  is  simply  that  of  getting  a  uniformly  agreed  GCPN.  Apart  from  this 
function  the  coordinator  is  not  essential  to  the  operation  of  the  proposed  algorithm.  If  a  uniformly  agreed  GCPN  can 
be  made  known  to  individual  sites,  then  the  centralized  nature  of  the  coordinator  can  be  eliminated.  One  way  to 
achieve  this  is  to  preassign  the  clock  values  at  which  checkpoints  will  be  taken.  For  example,  we  may  take  check¬ 
points  at  clock  values  as  a  multiple  of  1000.  Whenever  the  local  clock  of  a  site  crosses  a  multiple  of  this  value, 
checkpointing  can  begin. 

If  the  frequency  of  checkpointing  is  related  to  load  conditions  and  not  necessarily  to  clock  values,  then  the 
preassigned  GCPN  will  not  work  as  well.  In  this  case  a  node  will  have  to  assume  the  role  of  the  checkpointing  coor¬ 
dinator  to  initiate  the  checkpoint  A  unique  node  has  to  be  identified  as  the  coordinator.  This  may  be  achieved  by 
using  solutions  to  the  mutual  exclusion  problem (RIC81)  and  making  the  selection  of  the  coordinator  a  critical  sec¬ 
tion  activity. 

The  properties  of  global  consioU;ncy  and  non-interference  of  checkpointing  results  in  some  overhead  and 
reduces  the  processing  time  of  transactions  during  checkpointing.  For  applications  where  continuous  processing  is 
so  essential  that  the  blocking  of  transaction  processing  for  checkpointing  is  not  feasible,  we  believe  that  a  non¬ 
interfering  approach  provides  a  practical  solution  to  the  problem  of  checkpointing  and  recovery  in  distributed  data¬ 
base  systems. 
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